<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://naveenreddyvarikuti.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://naveenreddyvarikuti.github.io/" rel="alternate" type="text/html" /><updated>2026-06-14T12:21:15+00:00</updated><id>https://naveenreddyvarikuti.github.io/feed.xml</id><title type="html">Naveen Reddy</title><subtitle>Diving deep into AI research from RL for LLMs to World Models.
</subtitle><author><name>Naveen Reddy</name><email>naveenreddyvarikuti@gmail.com</email></author><entry><title type="html">Positional Encoding Explained: Sinusoidal Embeddings and RoPE (Part 2)</title><link href="https://naveenreddyvarikuti.github.io/2026/06/06/positional-encoding-sinusoidal-and-rope.html" rel="alternate" type="text/html" title="Positional Encoding Explained: Sinusoidal Embeddings and RoPE (Part 2)" /><published>2026-06-06T08:00:00+00:00</published><updated>2026-06-06T08:00:00+00:00</updated><id>https://naveenreddyvarikuti.github.io/2026/06/06/positional-encoding-sinusoidal-and-rope</id><content type="html" xml:base="https://naveenreddyvarikuti.github.io/2026/06/06/positional-encoding-sinusoidal-and-rope.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In <a href="https://naveenreddyvarikuti.github.io/2026/05/23/positional-encoding-transformers-explained.html">Part 1</a>, we explored three simple approaches to positional encoding. Direct integers, normalized values, and
binary vectors.</p>

<p>The most important lesson came from binary encoding. It revealed a multi
frequency structure hidden inside position representations. Each bit
oscillates at a different rate. Low bits flip rapidly, capturing fine
grained position differences. High bits flip slowly, capturing coarse
position in the sequence.</p>

<p>The structure was right. The problem was the shape. Square waves jump
between 0 and 1 with no in between. Adjacent positions sometimes looked
very different in vector space. Neural networks need smooth inputs to
learn smooth functions.</p>

<p>Replace the square waves with smooth waves but keeping the
multi frequency structure and making it continuous.</p>

<p>In this post, we will cover two approaches:</p>

<ul>
  <li>
    <p><strong>Sinusoidal positional encoding</strong>, introduced in the original
Transformer paper (Vaswani et al., 2017). It uses sine and cosine waves
at geometrically spaced frequencies to give each position a unique,
smooth, bounded vector.</p>
  </li>
  <li>
    <p><strong>Rotary Position Embeddings (RoPE)</strong>, introduced by Su et al. (2021).
Instead of adding position to the embedding, it rotates the query and
key vectors by their position. This makes relative position evident via
the dot product naturally, with no learning required.</p>
  </li>
</ul>

<p>Sinusoidal encoding was a great step. But it has a structural
limitation in how it mixes position with meaning. RoPE fixes that
limitation, and is the method used in nearly every modern large language
model today, including LLaMA, Mistral, Gemma, and Phi.</p>

<p>We will build both ideas step by step. Every formula will be derived from
scratch. Every design choice will be motivated by a specific problem.</p>

<p>Let us start from exactly where Part 1 left off: the square waves of
binary encoding and the two functions that make them smooth.</p>

<h2 id="idea-4-sinusoidal-encoding">Idea 4: Sinusoidal Encoding</h2>
<h3 id="the-smoothest-periodic-function">The Smoothest Periodic Function</h3>

<p>The smoothest possible periodic function is the sine wave or cos wave.</p>

<p>A sine wave does not jump. It rises and falls continuously. Adjacent points
on the wave are always close to each other in value. Two positions that
are near each other will always produce sine values that are near each
other.</p>

<p>Compare a square wave and a sine wave at the same frequency:</p>

<p><img src="/assets/animations/Positional_Encoding/sinwave_vs_squarewave.png" alt="Sin wave vs Square Wave" /></p>

<p>Both waves oscillate at the same rate. Both repeat at the same period. The
difference is how they get from low to high. The square wave jumps. The
sine wave moves smoothly.</p>

<p>Before we work out the formulas, it helps to actually see what these waves
look like in motion.</p>

<p><img src="/assets/animations/Positional_Encoding/SinusoidalSpiral.gif" alt="Sinusoidal Spiral" /></p>

<p>The animation traces out sine waves at 3 frequencies (dim 1, dim 2, dim 3). As the
position advances, the wave moves smoothly through space. There are no
sudden jumps. There are no discrete flips. Every step from one position
to the next is a continuous change.</p>

<h3 id="building-a-multi-frequency-encoding">Building a Multi Frequency Encoding</h3>

<p>If we take inspiration from binary encoding, we want multiple sine waves
stacked together. Each one oscillating at a different frequency.</p>

<p>For position $pos$ and dimension index $i$, the most natural starting point
is:</p>

\[PE(pos, i) = \sin(pos \cdot \omega_i)\]

<p>Where $\omega_i$ is the frequency for dimension $i$.</p>

<p>Different dimensions get different frequencies. Some dimensions oscillate
fast, like the LSB in binary. Others oscillate slow, like the MSB.</p>

<p>We now have to answer two questions:</p>

<ol>
  <li>How do we choose the frequencies $\omega_i$ for each dimension?</li>
  <li>Is using only sine enough, or do we also need cosine?</li>
</ol>

<p>The original Transformer paper answers both. Each one solves a specific problem.</p>

<p>Let us look at the frequency choice first.</p>

<h3 id="choosing-the-frequencies">Choosing the Frequencies</h3>

<p>We need a range of frequencies. Some should be high, so the encoding can
distinguish nearby positions sharply. Some should be low, so the encoding
can carry information across long distances without repeating.</p>

<p>The paper uses this formula for the frequency of dimension pair $i$:</p>

\[\omega_i = \frac{1}{10000^{2i/d}}\]

<p>Where $d$ is the total dimensionality of the encoding.</p>

<p>Let us see what this gives us.</p>

<p>For dimension pair $i = 0$:</p>

\[\omega_0 = \frac{1}{10000^{0}} = 1\]

<p>The wave oscillates rapidly. The value changes meaningfully with every
position. This is the “fast bit” of the encoding.</p>

<p>For dimension pair $i = d/2$:</p>

\[\omega_{d/2} = \frac{1}{10000^{1}} = \frac{1}{10000}\]

<p>The wave oscillates extremely slowly. It barely changes over thousands of
positions. This is the “slow bit” of the encoding.</p>

<p>Between these two extremes, the frequencies decrease smoothly on a
geometric scale.</p>

<p><img src="/assets/animations/Positional_Encoding/sinwaves_vs_frequencies.png" alt="Sin wave vs frequencies" /></p>

<p>This is the same multi frequency structure we saw in binary encoding. Fast
dimensions for fine grained position. Slow dimensions for coarse position.
The difference is that every wave is smooth.</p>

<h3 id="the-general-form">The general form:</h3>

\[y = \sin(\omega \cdot x)\]

<p>Where $\omega$ is the frequency. Larger $\omega$ means the wave oscillates
faster. Smaller $\omega$ means it oscillates slower.</p>

<p>To see this concretely, let us look at four sine waves with progressively
lower frequencies. Consider x=pos and the number multiplied to the x (pos) be the 
frequency $\omega$</p>

<h4 id="sinx">$\sin(x)$</h4>

<p>This is the baseline. The wave completes one full cycle every $2\pi$
units. It oscillates rapidly.</p>

<p><img src="/assets/animations/Positional_Encoding/sin_x.png" alt="Sinx" /></p>

<h4 id="sinx10">$\sin(x/10)$</h4>

<p>Dividing the input by 10 stretches the wave horizontally by a factor of 10.
The wave still does the same thing, but it takes 10 times longer to do
each cycle. One full cycle now takes about 63 positions.</p>

<p><img src="/assets/animations/Positional_Encoding/sin_x_over_10.png" alt="Sinx/10" /></p>

<h4 id="sinx50">$\sin(x/50)$</h4>

<p>Now the wave is very slow. Across 100 positions, we see only about 0.32
cycles. The wave is starting to look like a gentle curve rather than a
rapid oscillation.</p>

<p><img src="/assets/animations/Positional_Encoding/sin_x_over_50.png" alt="Sinx/50" /></p>

<h4 id="sinx100">$\sin(x/100)$</h4>

<p>At this frequency, we do not even complete one full cycle across 100
positions. The wave is nearly monotonic over the visible range. Adjacent
positions look almost identical.</p>

<p><img src="/assets/animations/Positional_Encoding/sin_x_over_100.png" alt="Sinx/100" /></p>

<h3 id="stacking-them-together">Stacking Them Together</h3>

<p>If we stack these four waves on the same axis, we see the spectrum in
one view.</p>

<p><img src="/assets/animations/Positional_Encoding/sin_comparison.png" alt="Stacked Sin Waves" /></p>

<p>This is the same idea as binary encoding’s stack of square waves, but
smooth. Fast waves capture fine grained position changes. Slow waves
capture coarse, sequence wide position.</p>

<p>Each wave tells the model a different thing about where a token sits.</p>

<p>This gives us the visual intuition. The actual transformer formula
controls this spectrum through the choice of base 10000. Let us see why.</p>

<h3 id="why-10000">Why 10000?</h3>

<p>The number 10000 looks arbitrary. but its not.</p>

<p>It controls the range of frequencies. With a base of 10000, the slowest
wave completes a full cycle every $2\pi \times 10000 \approx 62{,}832$
positions. This means within any reasonable sequence length, the slow
dimensions never repeat. Every position gets a unique encoding.</p>

<p>let us plot the same encoding with three
different base values: 100, 10000, and 100 million.</p>

<p>For each base, we will look at four different dimension indices to see
how the wave behaves across positions 0 to 1000.</p>

<p><img src="/assets/animations/Positional_Encoding/why_10000_visualized.png" alt="Why the Base Matters" /></p>

<p>Each row shows one base value. Each column shows one dimension index. The
top row uses base 100. The middle row uses base 10000. The bottom row uses
base 100 million.</p>

<h3 id="base--100-top-row">Base = 100 (top row)</h3>

<p>Look at the four columns left to right.</p>

<ul>
  <li>
    <p><strong>dim 0</strong> oscillates rapidly. This is expected. Every base produces a
fast wave at dim 0.</p>
  </li>
  <li>
    <p><strong>dim 64</strong> still oscillates a lot. Across 1000 positions, you see about
good number of  full cycles. The wave is far from slow.</p>
  </li>
  <li>
    <p><strong>dim 96</strong> also oscillates clearly. About good number of cycles across 1000 positions.</p>
  </li>
  <li>
    <p><strong>dim 127</strong> (the slowest dimension) still completes few full cycles in
1000 positions.</p>
  </li>
</ul>

<p>This is the problem with a small base. Even the slowest dimensions
oscillate within the visible range. The waves repeat. Two positions far
apart can produce identical encodings in every dimension simultaneously.
The model loses its ability to tell distant positions apart.</p>

<p>A base of 100 spreads the frequency spectrum too narrowly. Everything moves too fast.</p>

<h3 id="base--10000-middle-row">Base = 10000 (middle row)</h3>

<p>This is the actual choice from the Transformer paper. Look at how the
behavior changes.</p>

<ul>
  <li>
    <p><strong>dim 0</strong> still oscillates rapidly. The fast end of the spectrum is
unchanged.</p>
  </li>
  <li>
    <p><strong>dim 64</strong> completes about 2 full cycles in 1000 positions. Slow enough
to carry meaningful long range information, fast enough to differentiate
positions.</p>
  </li>
  <li>
    <p><strong>dim 96</strong> does not even complete one cycle. The curve rises smoothly
across the entire visible range. This dimension can distinguish between
position 100 and position 800 because the values are different.</p>
  </li>
  <li>
    <p><strong>dim 127</strong> is nearly flat. The value barely changes from 0 to 0.1 across
1000 positions. This is the slowest end of the spectrum.</p>
  </li>
</ul>

<p>The spread is right. Fast dimensions stay fast. Slow dimensions actually
get slow. Every position in a 1000 token sequence gets a unique vector
across the full encoding.</p>

<h3 id="base--100-million-bottom-row">Base = 100 million (bottom row)</h3>

<p>Now look at what happens when the base is too large.</p>

<ul>
  <li>
    <p><strong>dim 0</strong> still oscillates rapidly. The fast end never changes with base.</p>
  </li>
  <li>
    <p><strong>dim 64</strong> is nearly a straight line. The wave has been stretched so far
that it barely changes across 1000 positions.</p>
  </li>
  <li>
    <p><strong>dim 96</strong> is completely flat at zero.</p>
  </li>
  <li>
    <p><strong>dim 127</strong> is also completely flat at zero.</p>
  </li>
</ul>

<p>Most of the dimensions are useless. Their values are essentially the same
across every position. They contribute no information about where a token
is.</p>

<p>A base of 100 million spreads the frequency spectrum too widely. Almost
the entire encoding is squeezed into the very slowest dimensions, which
do nothing for typical sequence lengths.</p>

<h3 id="the-problem">The Problem</h3>

<p>A smaller base, like 100, would cause the slow waves to repeat much
sooner. Different positions would start getting identical encodings.
The model would lose the ability to tell them apart.</p>

<p>A larger base, like a million, would spread the frequencies too thin. Most
dimensions would oscillate too slowly to be useful.</p>

<p>The value 10000 was chosen to balance these two concerns. It is large
enough to avoid repetition within typical context lengths, but not so large
that the frequency spectrum becomes useless.</p>

<h3 id="so-far-so-good">So Far, So Good</h3>

<p>We now have a smooth multi frequency encoding. Every position gets a
unique vector. The values are bounded between -1 and 1. The transitions
between adjacent positions are continuous.</p>

<p>This is already a complete positional encoding. We could stop here and use
just sine waves.</p>

<p>But the original Transformer paper does not stop here. Half of the
dimensions use sine. The other half use cosine.</p>

<p>Why? What does cosine give us that sine alone does not?</p>

<h3 id="why-both-sine-and-cosine">Why Both Sine and Cosine?</h3>

<p>We have a working positional encoding using only sine waves. Each position
gets a vector. The values are bounded and smooth. The frequencies span from
fast to slow.</p>

<p>So why does the original Transformer paper use sine for half the dimensions
and cosine for the other half?</p>

<p>The formula assigns sine to even indexed dimensions and cosine to odd
indexed dimensions:</p>

\[PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right)\]

\[PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)\]

<p>Cosine looks redundant. It is just a shifted sine wave. So
why include it?</p>

<p>The reason is a property called linear transformability.</p>

<h3 id="linear-transformability">Linear Transformability</h3>

<p>We want the encoding to have a useful structure.</p>

<p>Given the encoding of one position, we want to reach the encoding of any
other position using a simple linear operation.</p>

<p>If we know the encoding at position $pos$, we want to compute the encoding
at position $pos + k$ by just multiplying with a matrix. And the same
shift should always use the same matrix.</p>

<p>This is a strong property. If it holds, the model can reason about relative
shifts using simple linear layers. This matters because $W_q$ and $W_k$ are
exactly that, linear layers.</p>

<h3 id="the-pair-representation">The Pair Representation</h3>

<p>Take a single frequency $\theta$. Represent each position as a pair of
values, one sine and one cosine:</p>

\[PE(pos) = (\sin(pos\,\theta),\ \cos(pos\,\theta))\]

<p>Now, let us ask one question. Can we get from $PE(pos)$ to $PE(pos + k)$ using a
fixed matrix?</p>

<h3 id="we-can-transform-using-fixed-matrix">We can transform using fixed matrix</h3>

<p>Write down the encoding at position $pos + k$:</p>

\[PE(pos + k) = (\sin((pos + k)\theta),\ \cos((pos + k)\theta))\]

<p>Expand each term using the angle addition formulas:</p>

\[\sin((pos + k)\theta) = \sin(pos\,\theta)\cos(k\theta) + \cos(pos\,\theta)\sin(k\theta)\]

\[\cos((pos + k)\theta) = \cos(pos\,\theta)\cos(k\theta) - \sin(pos\,\theta)\sin(k\theta)\]

<p>Look at the right hand side. Both new values are built only from
$\sin(pos\,\theta)$ and $\cos(pos\,\theta)$. They are scaled by
$\cos(k\theta)$ and $\sin(k\theta)$.</p>

<p>This is exactly a matrix multiplication:</p>

\[\begin{bmatrix} \sin((pos+k)\theta) \\ \cos((pos+k)\theta) \end{bmatrix}
=
\begin{bmatrix} \cos(k\theta) &amp; \sin(k\theta) \\ -\sin(k\theta) &amp; \cos(k\theta) \end{bmatrix}
\begin{bmatrix} \sin(pos\,\theta) \\ \cos(pos\,\theta) \end{bmatrix}\]

<p>Call this matrix $M_k$:</p>

\[M_k =
\begin{bmatrix} \cos(k\theta) &amp; \sin(k\theta) \\ -\sin(k\theta) &amp; \cos(k\theta) \end{bmatrix}\]

<p>So we have:</p>

\[PE(pos + k) = M_k \cdot PE(pos)\]

<p>The matrix $M_k$ depends only on the shift $k$. It does not depend on
$pos$. The same shift always uses the same matrix.</p>

<p>This is the property we wanted.</p>

<h3 id="why-sine-alone-cannot-do-this">Why Sine Alone Cannot Do This</h3>

<p>Now try the same thing with only sine.</p>

<p>Suppose the encoding is just $\sin(pos\,\theta)$, a single value.</p>

<p>To get the encoding at $pos + k$, we need:</p>

\[\sin((pos + k)\theta) = \sin(pos\,\theta)\cos(k\theta) + \cos(pos\,\theta)\sin(k\theta)\]

<p>Look at the right hand side. It needs $\cos(pos\,\theta)$.</p>

<p>But a sine only encoding does not store $\cos(pos\,\theta)$. We only have
$\sin(pos\,\theta)$. The term we need is missing.</p>

<p>There is no way to recover $\cos(pos\,\theta)$ from $\sin(pos\,\theta)$
using a linear operation. So the shift cannot be written as a fixed matrix.</p>

<p>The transformation is impossible with sine alone.</p>

<h3 id="why-cosine-is-needed">Why Cosine Is Needed</h3>

<p>Cosine is the missing piece.</p>

<p>When we store both $\sin(pos\,\theta)$ and $\cos(pos\,\theta)$ together,
both terms in the expansion are available. The matrix $M_k$ has everything
it needs. The shift works.</p>

<p>This is why the encoding pairs sine and cosine at every frequency. It is
not redundant. Cosine supplies the second component that makes the linear
shift possible.</p>

<p>With both functions present, shifting a position becomes a rotation by the
matrix $M_k$.</p>

<p>This rotation idea will come back in a much bigger way when we reach RoPE.</p>

<p>So, this relative information is learned by model implicitly not explicitly</p>

<h3 id="how-sinusoidal-encoding-enters-the-model">How Sinusoidal Encoding Enters the Model</h3>

<p>So far we have studied the positional encoding on its own. We derived its
frequencies, its sine and cosine pairing, and its nice properties.</p>

<p>But there is a question we have not asked yet. How does this encoding
actually get used inside the Transformer?</p>

<p>The answer creates a hidden problem. The clean properties we discussed do not
fully establish once the encoding meets the rest of the model.</p>

<h3 id="two-properties-two-places">Two Properties, Two Places</h3>

<p>Before we go further, let’s separate two things we have established.</p>

<p>The first is the <strong>dot product property</strong>. The dot product of two encodings
gives $\cos(\theta(m - n))$, which depends only on relative position. This
property matters inside the attention mechanism, where queries and keys are
multiplied together.</p>

<p>The second is the <strong>linear shift property</strong>. A fixed matrix $M_k$ can shift
an encoding from one position to another. This property matters for the
linear layers in the model, such as $W_q$ and $W_k$.</p>

<p>These are two separate capabilities. They both need sine and cosine, but
for different mathematical reasons.</p>

<h3 id="position-is-added-to-the-embedding">Position Is Added to the Embedding</h3>

<p>In the original Transformer, the positional encoding is added directly to
the token embedding before anything else happens.</p>

<p>For a token at position $m$:</p>

\[\text{input}_m = \text{embed}_m + PE_m\]

<p>The embedding carries the meaning of the token. The encoding carries its
position. We add them together into a single vector.</p>

<p>This combined vector is what flows into the attention mechanism. The model
then computes queries and keys from it:</p>

\[Q_m = W_q \cdot (\text{embed}_m + PE_m) = W_q \cdot \text{embed}_m + W_q \cdot PE_m\]

\[K_n = W_k \cdot (\text{embed}_n + PE_n) = W_k \cdot \text{embed}_n + W_k \cdot PE_n\]

<p>Each query and key now has two parts. A semantic part from the embedding,
and a positional part from the encoding.</p>

<h3 id="the-four-term-expansion">The Four Term Expansion</h3>

<p>Now we compute the attention score. The score is the dot product of $Q_m$
and $K_n$.</p>

<p>Both $Q_m$ and $K_n$ have two parts. When we multiply two sums, every part
of the first multiplies every part of the second. Two parts times two parts
gives four terms.</p>

\[Q_m \cdot K_n = (W_q \cdot \text{embed}_m + W_q \cdot PE_m) \cdot (W_k \cdot \text{embed}_n + W_k \cdot PE_n)\]

<p>Expanding gives four terms:</p>

\[\begin{aligned}
Q_m \cdot K_n =\ &amp;(W_q \cdot \text{embed}_m) \cdot (W_k \cdot \text{embed}_n) \quad &amp;\text{Term 1}\\
+\ &amp;(W_q \cdot \text{embed}_m) \cdot (W_k \cdot PE_n) \quad &amp;\text{Term 2}\\
+\ &amp;(W_q \cdot PE_m) \cdot (W_k \cdot \text{embed}_n) \quad &amp;\text{Term 3}\\
+\ &amp;(W_q \cdot PE_m) \cdot (W_k \cdot PE_n) \quad &amp;\text{Term 4}
\end{aligned}\]

<p>Let us read each term.</p>

<ul>
  <li>
    <p><strong>Term 1</strong> is purely semantic. It is the meaning of token $m$ against the
meaning of token $n$. No position involved.</p>
  </li>
  <li>
    <p><strong>Term 2</strong> is a cross term. The meaning of token $m$ against the position
of token $n$. Semantic mixed with position.</p>
  </li>
  <li>
    <p><strong>Term 3</strong> is the other cross term. The position of token $m$ against the
meaning of token $n$. Position mixed with semantic.</p>
  </li>
  <li>
    <p><strong>Term 4</strong> is purely positional. The position of token $m$ against the
position of token $n$. This is the term that contains
$\cos(\theta(m - n))$, the clean relative position signal.</p>
  </li>
</ul>

<video autoplay="" loop="" muted="" playsinline="" style="max-width:100%; height:auto; display:block; margin:1.5rem auto;">
  <source src="/assets/animations/Positional_Encoding/FourTermExpansio.mp4" type="video/mp4" />
</video>

<p>The relative position information we worked so hard to derive lives only in
Term 4.</p>

<h3 id="the-entanglement-problem">The Entanglement Problem</h3>

<p>Here is the issue.</p>

<p>The model never sees Term 4 by itself. It sees the sum of all four terms.
The clean relative position signal is buried inside a mixture.</p>

<p>Terms 2 and 3 are the troublemakers. They mix semantic content with
positional content. They are noise sitting on top of the signal the model
actually wants.</p>

<p>The model has to work through this mixture. There is no part of the
architecture that isolates Term 4. The model must learn, on its own, how to
make use of the relative position signal while ignoring the cross terms.</p>

<p>This creates a heavy burden on $W_q$ and $W_k$. These two matrices must do
two jobs at once. They must project semantic meaning into a useful space.
And they must preserve the positional structure so the relative position
signal survives. Two competing goals, packed into one set of weights.</p>

<h3 id="absolute-versus-relative-position">Absolute Versus Relative Position</h3>

<p>There is a deeper issue here.</p>

<p>The encoding $PE_m$ stores absolute position. $PE_5$ is a fixed vector. It
means position 5 and nothing else, no matter what sequence it appears in.</p>

<p>But what the model actually wants is relative position. It wants to know
that two tokens are 9 apart, not that one is at position 5 and the other at
position 14.</p>

<p>Relative position is not stored anywhere. It only appears as a byproduct,
inside Term 4, after the dot product is computed. It is never represented
directly.</p>

<p>So the model is given absolute positions and asked to figure out relative
positions on its own. It can do this, but only by learning. There is no
guarantee it learns it perfectly.</p>

<h3 id="position-is-fused-with-meaning">Position Is Fused With Meaning</h3>

<p>There is one more limitation, and it is structural.</p>

<p>Once we compute $\text{embed}_m + PE_m$, the two parts are added into a
single vector. They cannot be pulled apart again. Addition destroys the
boundary between them.</p>

<p>Every layer after this point sees one fused vector. It cannot choose to
look at only the meaning, or only the position. The two are tangled
together for the rest of the network.</p>

<p>Sometimes the model only needs meaning. Sometimes it only needs position.
But it cannot separate them. It is stuck with the mixture.</p>

<h3 id="summing-up-the-limitations">Summing Up the Limitations</h3>

<p>Sinusoidal encoding is smooth, bounded, unique, and carries relative
position inside its dot product. It was a real step forward.</p>

<p>But it has three weaknesses:</p>

<ul>
  <li>
    <p><strong>Entanglement.</strong> Relative position is buried inside a four term mixture.
Two cross terms add noise the model must learn to ignore.</p>
  </li>
  <li>
    <p><strong>No direct relative position.</strong> The encoding stores absolute position.
Relative position only appears as a byproduct of the dot product, never
as an explicit representation.</p>
  </li>
  <li>
    <p><strong>Fused representation.</strong> Position is added into the embedding and can
never be separated. Every later layer is forced to handle the mixture.</p>
  </li>
</ul>

<p>All three problems share one root cause. Position is <strong>added</strong> to the
embedding.</p>

<p>What if we did not add position at all? What if, instead of adding a
positional vector, we applied position as an operation directly on the
query and key?</p>

<p>This is the idea behind RoPE.</p>

<h2 id="idea-5-rotary-position-embeddings-rope">Idea 5: Rotary Position Embeddings (RoPE)</h2>

<p>Sinusoidal encoding had three problems. All of them came from one decision:
position was added to the embedding.</p>

<p>What if we never add position at all?</p>

<p>This is the idea behind RoPE, introduced by Su et al. in 2021. It is the
positional encoding used in almost every modern large language model,
including LLaMA, Mistral, Gemma, and Phi.</p>

<h3 id="the-design-goal">The Design Goal</h3>

<p>In sinusoidal encoding, the attention score expanded into four terms. Only
one of them carried clean relative position. The other two were noise that
mixed meaning with position.</p>

<p>We want something better. We want the attention score to look like this:</p>

\[Q_m \cdot K_n = f(\text{semantics},\ m - n)\]

<p>One clean expression. The score should depend on the meaning of the two
tokens and on their relative distance $m - n$. Nothing else. No cross
terms. No entanglement.</p>

<h3 id="the-key-insight">The Key Insight</h3>

<p>The problem with sinusoidal encoding was the order of operations.</p>

<p>We added position to the embedding first. Then we multiplied by $W_q$ and
$W_k$. Because position was already mixed into the embedding, the
multiplication produced cross terms.</p>

<p>RoPE flips the order. It does not touch the embedding. Instead, it lets
$W_q$ and $W_k$ do their work first, producing the query and key. Then it
applies position directly to those vectors.</p>

<p>In other words: apply position <strong>after</strong> the projection, <strong>not before</strong>.</p>

<h3 id="why-this-helps">Why This Helps</h3>

<p>When position is applied after $W_q$ and $W_k$, those two matrices no
longer have to deal with position at all. Their only job is to handle
meaning. They project the token embedding into a query or key that captures
semantic content. That is it.</p>

<p>Position becomes a separate, independent step. It is applied on top of the
query and key as its own operation.</p>

<p>There are no longer two competing goals packed into $W_q$ and $W_k$. The
projection handles meaning. The positional operation handles position. Each
does one job.</p>

<p>Now the question becomes: what operation should we apply to the query and
key to inject position?</p>

<p>The answer is rotation.</p>

<h3 id="the-rotation-operation">The Rotation Operation</h3>

<p>Take a query vector. To keep things simple, let’s start with just two dimensions.</p>

\[Q = (q_0,\ q_1)\]

<p>This query sits at position $m$. We pick a frequency $\theta$, the same
kind of frequency we used in sinusoidal encoding.</p>

<p>To inject position, we rotate this 2D vector by an angle of $m\theta$. The
rotation is done with a rotation matrix:</p>

\[\begin{bmatrix} q_0^{new} \\ q_1^{new} \end{bmatrix}
=
\begin{bmatrix} \cos(m\theta) &amp; -\sin(m\theta) \\ \sin(m\theta) &amp; \cos(m\theta) \end{bmatrix}
\begin{bmatrix} q_0 \\ q_1 \end{bmatrix}\]

<p>This uses the same sine and cosine values as sinusoidal encoding. But the
operation is different. We are not adding anything. We are rotating the
vector.</p>

<p>The position $m$ decides how much we rotate. A token at position 1 is
rotated by $\theta$. A token at position 2 is rotated by $2\theta$. A token
at position 100 is rotated by $100\theta$. The further along the sequence,
the more the vector turns.</p>

<h3 id="writing-out-the-rotation">Writing Out the Rotation</h3>

<p>Let us expand the matrix multiplication to see the new values directly.</p>

\[q_0^{new} = q_0 \cos(m\theta) - q_1 \sin(m\theta)\]

\[q_1^{new} = q_0 \sin(m\theta) + q_1 \cos(m\theta)\]

<p>The new query is a mix of the old components, weighted by sine and cosine
of the rotation angle.</p>

<p>The length of the vector does not change. Rotation only turns the vector,
it does not stretch or shrink it. The meaning carried by the magnitude
stays intact. Only the direction shifts, and the amount of shift encodes
the position.</p>

<p>We do the exact same thing to the key vector, using its position $n$:</p>

\[k_0^{new} = k_0 \cos(n\theta) - k_1 \sin(n\theta)\]

\[k_1^{new} = k_0 \sin(n\theta) + k_1 \cos(n\theta)\]

<p>Now both the query and the key have been rotated by their own positions.</p>

<p>The next question is what happens when we take the dot product of
two rotated vectors.</p>

<h3 id="why-rotation-gives-pure-relative-position">Why Rotation Gives Pure Relative Position</h3>

<p>We have rotated the query by its position and the key by its position. Now
we take their dot product and see what comes out.</p>

<p>This is the heart of RoPE. The result is clean in a way sinusoidal encoding
never was.</p>

<h3 id="setting-up-the-angles">Setting Up the Angles</h3>

<p>Every 2D vector has a direction, which we can describe with an angle.</p>

<p>Let the query $Q$ point in direction $\alpha$. This angle captures the
semantic content of the query, the meaning that $W_q$ produced.</p>

<p>Let the key $K$ point in direction $\beta$. This angle captures the
semantic content of the key.</p>

<p>Before any rotation, the dot product of two unit vectors depends on the
angle between them:</p>

\[Q \cdot K = \cos(\alpha - \beta)\]

<p>The score depends on $\alpha - \beta$, the angle between the two
directions. This is the semantic relationship between the query and the
key.</p>

<p><img src="/assets/animations/Positional_Encoding/relation_between_vectors.png" alt="Relation Between Vectors" /></p>

<h3 id="applying-the-rotation">Applying the Rotation</h3>

<p>Now we rotate. The query is at position $m$, so we turn it by $m\theta$.
The key is at position $n$, so we turn it by $n\theta$.</p>

<p>Rotation simply adds to the angle. After rotation:</p>

\[Q_m \text{ points in direction } \alpha + m\theta\]

\[K_n \text{ points in direction } \beta + n\theta\]

<p>Take the dot product of the rotated vectors. It depends on the angle
between them, just like before:</p>

\[Q_m \cdot K_n = \cos\big((\alpha + m\theta) - (\beta + n\theta)\big)\]

<p>Simplify the inside:</p>

\[Q_m \cdot K_n = \cos\big((\alpha - \beta) + (m - n)\theta\big)\]

<p>Look at this result carefully.</p>

<video autoplay="" loop="" muted="" playsinline="" style="max-width:100%; height:auto; display:block; margin:1.5rem auto;">
  <source src="/assets/animations/Positional_Encoding/RotationRelativeAngle.mp4" type="video/mp4" />
</video>

<h3 id="reading-the-result">Reading the Result</h3>

<p>There is one term. Just one. No four term expansion. No cross terms.</p>

<p>Inside the cosine there are two pieces:</p>

<ul>
  <li>
    <p>$(\alpha - \beta)$ is the semantic relationship. It is exactly the same
angle that was there before rotation. It is untouched.</p>
  </li>
  <li>
    <p>$(m - n)\theta$ is the positional piece. It depends only on $m - n$, the
relative distance between the two tokens.</p>
  </li>
</ul>

<p>The semantic part and the positional part sit side by side inside a single
expression. The meaning is preserved. The position is relative. There
is nothing to disentangle.</p>

<p>This is exactly the goal we set at the start. The attention score depends
only on the semantics and on the relative distance $m - n$.</p>

<h3 id="the-same-result-through-matrices">The Same Result Through Matrices</h3>

<p>The angle argument is good, but let us confirm it with the matrices
directly.</p>

<p>Write the rotated query as $R_m Q$ and the rotated key as $R_n K$, where
$R_m$ and $R_n$ are rotation matrices. Their dot product is:</p>

\[Q_m \cdot K_n = (R_m Q)^\top (R_n K) = Q^\top R_m^\top R_n K\]

<p>Rotation matrices have a special property. They are orthogonal, which means
the transpose equals the inverse:</p>

\[R_m^\top = R_m^{-1} = R_{-m}\]

<p>So the expression becomes:</p>

\[Q^\top R_{-m} R_n K\]

<p>Rotations combine by adding their angles. Rotating by $-m$ and then by $n$
is the same as rotating by $n - m$:</p>

\[R_{-m} R_n = R_{n - m}\]

<p>Putting it together:</p>

\[Q_m \cdot K_n = Q^\top R_{n - m} K\]

<p>The result depends only on $R_{n - m}$. The absolute positions $m$ and $n$
never appear on their own. Only their difference $n - m$ survives.</p>

<p>This is the same conclusion as the angle argument, now proven through the
matrix algebra.</p>

<h3 id="does-rotation-corrupt-semantic-meaning">Does Rotation Corrupt Semantic Meaning?</h3>

<p>There is a natural objection to all of this. Let us look at it carefully,
because initially this is the doubt, I faced initially.</p>

<h3 id="the-concern">The Concern</h3>

<p>The direction of $Q$ encodes the meaning of the token. That is what we said.
The angle $\alpha$ carries semantic content.</p>

<p>Rotation changes the direction. After rotation, $Q$ points in a new
direction $\alpha + m\theta$.</p>

<p>So if direction is meaning, and rotation changes direction, then rotation
must change meaning. Rotation should corrupt the semantic content of the
token.</p>

<p>This reasoning feels right. But it has a flaw.</p>

<h3 id="why-the-reasoning-fails">Why the Reasoning Fails</h3>

<p>The mistake is in the first step. Meaning in attention is not the direction
of a single vector.</p>

<p>What the attention mechanism actually computes is the dot product between a
query and a key. And the dot product depends on the angle <strong>between</strong> them,
not on either direction alone:</p>

\[Q \cdot K = \cos(\alpha - \beta)\]

<p>The semantic relationship is $\alpha - \beta$. It is a relationship between
two vectors, not a property of one.</p>

<p>Now look again at what rotation does:</p>

\[Q_m \cdot K_n = \cos\big((\alpha - \beta) + (m - n)\theta\big)\]

<p>The semantic relationship $(\alpha - \beta)$ is still inside the cosine. The rotation only added a positional piece next to it.</p>

<p>The meaning is preserved. It is combined with position, not removed or destroyed.</p>

<video autoplay="" loop="" muted="" playsinline="" style="max-width:100%; height:auto; display:block; margin:1.5rem auto;">
  <source src="/assets/animations/Positional_Encoding/RotationPreservesAngle.mp4" type="video/mp4" />
</video>

<h3 id="the-ideal-case">The Ideal Case</h3>

<p>Consider two tokens at the same position, so $m = n$.</p>

<p>The positional piece becomes $(m - n)\theta = 0$. The dot product is:</p>

\[Q_m \cdot K_n = \cos(\alpha - \beta + 0) = \cos(\alpha - \beta)\]

<p>This is exactly the original dot product, with no rotation effect at all.</p>

<p>Two tokens at the same position have their full semantic relationship,
completely intact. Rotation changed nothing about the meaning.</p>

<h3 id="rotation-preserves-length">Rotation Preserves Length</h3>

<p>There is another way to see that rotation does not damage the vectors.</p>

<p>Rotation preserves length. A rotated vector has the same magnitude as the
original. We can prove this directly.</p>

<p>The squared length of a rotated vector is:</p>

\[|R v|^2 = (R v)^\top (R v) = v^\top R^\top R v\]

<p>Since rotation is orthogonal, $R^\top R = I$:</p>

\[v^\top R^\top R v = v^\top v = |v|^2\]

<p>The length is unchanged. The magnitude of a query or key often carries
learned information too, and rotation leaves it completely unchanged. Only the
direction turns.</p>

<video autoplay="" loop="" muted="" playsinline="" style="max-width:100%; height:auto; display:block; margin:1.5rem auto;">
  <source src="/assets/animations/Positional_Encoding/RotationPreservesLength.mp4" type="video/mp4" />
</video>

<h3 id="where-do-the-angles-α-and-β-come-from">Where Do the Angles α and β Come From?</h3>

<p>We have been talking about the query angle $\alpha$ and the key angle
$\beta$. There is a common confusion about what these angles actually are.
Let us clear it up.</p>

<h3 id="not-the-embedding-angle">Not the Embedding Angle</h3>

<p>The angle $\alpha$ is not the angle of the token embedding.</p>

<p>Remember the order of operations. First the embedding goes through $W_q$.
This produces the query. The angle $\alpha$ is the direction of that query,
after the projection.</p>

\[Q = W_q \cdot \text{embed}, \quad \alpha = \text{direction of } Q\]

<p>Same for the key. The angle $\beta$ is the direction of $K = W_k \cdot
\text{embed}$, after projection by $W_k$.</p>

<p>These angles live in the query and key space. They are not angles in the
original embedding space.</p>

<h3 id="why-they-are-different-spaces">Why They Are Different Spaces</h3>

<p>The projection matrix $W_q$ is rectangular. It might take a 768 dimensional
embedding and produce a 64 dimensional query, one per attention head.</p>

<p>A rectangular matrix does not just rotate the vector. It reshapes it,
reweights it, and drops it into a smaller space. The output direction has
no simple relationship to the input direction.</p>

<p>So the angle of the embedding and the angle $\alpha$ of the query are not
connected in any direct geometric way. The query angle is something new,
created by the projection.</p>

<h3 id="then-why-does-α--β-carry-meaning">Then Why Does (α − β) Carry Meaning?</h3>

<p>If $\alpha$ comes out of a projection, why does the angle $\alpha - \beta$
carry semantic meaning at all?</p>

<p>The answer is training.</p>

<p>$W_q$ and $W_k$ are not fixed. They are learned through gradient descent,
together with the rest of the model. During training, the loss pushes these
matrices to arrange the queries and keys in a useful way.</p>

<p>The arrangement that minimizes the loss is the one where:</p>

<ul>
  <li>
    <p>Tokens that <strong>should</strong> attend to each other get a small angle
$(\alpha - \beta)$. A small angle gives a high cosine, which gives a high
attention score.</p>
  </li>
  <li>
    <p>Tokens that <strong>should not</strong> attend get a large angle $(\alpha - \beta)$. A
large angle gives a low cosine, which gives a low attention score.</p>
  </li>
</ul>

<p>So the semantic meaning inside $(\alpha - \beta)$ is not inherited from the
embedding space. It is learned. Training shapes $W_q$ and $W_k$ until the
angle between a query and a key reflects how strongly the two tokens should
attend.</p>

<p>The query and key space is built by training to make $(\alpha - \beta)$
mean something. RoPE then rotates within that learned space.</p>

<h3 id="scaling-to-full-dimensions">Scaling to Full Dimensions</h3>

<p>So far we worked with a 2D query. Real queries have many dimensions, often
64 or 128 per head. How does rotation work there?</p>

<p>The idea is simple. We split the vector into pairs and rotate each pair on
its own.</p>

<h3 id="pairing-the-dimensions">Pairing the Dimensions</h3>

<p>Take a query of dimension $d$. Split it into $d/2$ consecutive pairs:</p>

\[(q_0, q_1),\ (q_2, q_3),\ \dots,\ (q_{d-2}, q_{d-1})\]

<p>Each pair is a little 2D vector. We rotate each one exactly the way we did
before.</p>

<p>The key detail is that each pair gets its own frequency. Pair $i$ uses
frequency:</p>

\[\theta_i = \frac{1}{10000^{2i/d}}\]

<p>This is the same frequency formula from sinusoidal encoding. Early pairs
get high frequencies and rotate fast. Later pairs get low frequencies and
rotate slowly.</p>

<p>Each pair is rotated independently. The first pair does not interact with
the second pair. There is no mixing across pairs.</p>

<h3 id="the-block-diagonal-matrix">The Block Diagonal Matrix</h3>

<p>If we write the full rotation as one big $d \times d$ matrix.</p>

<p>It is block diagonal. Along the diagonal sit $d/2$ small $2 \times 2$
rotation blocks, one per pair. Everywhere off the diagonal, the entries are
zero.</p>

\[R(m) =
\begin{bmatrix}
R(m\theta_0) &amp; 0 &amp; \cdots &amp; 0 \\
0 &amp; R(m\theta_1) &amp; \cdots &amp; 0 \\
\vdots &amp; \vdots &amp; \ddots &amp; \vdots \\
0 &amp; 0 &amp; \cdots &amp; R(m\theta_{d/2-1})
\end{bmatrix}\]

<p>Each block $R(m\theta_i)$ is the familiar $2 \times 2$ rotation matrix for
pair $i$ at position $m$.</p>

<h3 id="does-rope-hurt-attention-between-distant-tokens">Does RoPE Hurt Attention Between Distant Tokens?</h3>

<p>The attention score is
$\cos((\alpha - \beta) + (m - n)\theta)$. The semantic part and the
positional part sit together inside one cosine.</p>

<p>But look closely at those two parts. There is a tension between them.</p>

<h3 id="the-tension">The Tension</h3>

<p>Imagine a paragraph that starts with “Naveen finished the report” and ends,
many sentences later, with “He sent it to the team.” The word “He” refers
back to “Naveen.” For the model to understand the sentence, the query for
“He” must attend to the key for “Naveen.” Suppose these two words are 34
tokens apart.</p>

<p>The model wants them to attend. So the semantic angle $(\alpha - \beta)$ is
small, which pushes the cosine up toward a high score.</p>

<p>But the positional part $(m - n)\theta$ is not zero. The tokens are far
apart, so this term adds a real angle on top of the semantic one.</p>

<p>The two parts pull in different directions. The semantic angle wants the
score high. The positional angle pushes the total angle up, which pulls the
cosine down.</p>

<p>When the gap is large, the positional part can fight against the semantic
signal.  Does the position term quietly suppress
attention between tokens that should be linked?</p>

<h3 id="why-it-is-fine-at-normal-distances">Why It Is Fine at Normal Distances</h3>

<p>The answer is the frequency spectrum, and we already have the pieces.</p>

<p>The slow pairs have a tiny frequency. At a gap of 34, the extra angle they
add is almost nothing:</p>

\[0.0001 \cdot 34 \approx 0.0034 \text{ radians} \approx 0.2^\circ\]

<p>That is a fifth of a degree. The positional part barely moves the cosine in
the slow pairs.</p>

<p>And we saw that training pushes the long range signal exactly into those
slow pairs. So the “He to Naveen” link lives where the positional cost is
nearly zero. The semantic signal wins easily.</p>

<p>Multi head attention helps too. Different heads have their own $W_q$ and
$W_k$. Some heads specialize in long range links and lean entirely on the
slow pairs. The model has dedicated mechanism for exactly this case.</p>

<p>So at sentence and paragraph distances, the tension is real but harmless.
The design routes long range signals to where position does not interfere.</p>

<h3 id="where-it-genuinely-breaks">Where It Genuinely Breaks</h3>

<p>Now push the gap much further. Not 34 tokens. Try 32000 tokens.</p>

<p>Even the slow pairs have a small but nonzero frequency. Multiply it by a
huge gap and the angle is no longer small:</p>

\[0.0001 \cdot 32000 \approx 3.2 \text{ radians} \approx 183^\circ\]

<p>Now the slow pair has rotated past 180 degrees. And past 180 degrees, the
cosine is negative.</p>

<p>A negative cosine means the contribution is now opposite. The slow pair, which
was supposed to carry the long range signal, is now pushing the score down.
The model is being told to push these two tokens apart, even though they may
be strongly related.</p>

<p>This is not the model being confused about distance. It is worse. The
positional term has actively flipped and is working against the right
answer.</p>

<p>This is a real, known limitation of RoPE. It is why so much research goes
into extending RoPE to longer contexts.</p>

<h3 id="what-happens-when-rotation-passes-360-degrees">What Happens When Rotation Passes 360 Degrees</h3>

<p>The 180 degree problem points to a deeper issue. Rotation is periodic. Turn
far enough and you come back to where you started. Let us look at what that
means for position.</p>

<h3 id="the-aliasing-condition">The Aliasing Condition</h3>

<p>Cosine repeats every 360 degrees, or $2\pi$ radians. So two different gaps
can produce the exact same cosine.</p>

<p>For a pair with frequency $\theta_i$, two gaps $g_1$ and $g_2$ give the same
value when their difference completes a full number of turns:</p>

\[(g_1 - g_2) \cdot \theta_i = 2\pi \cdot k\]

<p>for some whole number $k$. When this happens, the two gaps are
indistinguishable in that pair. This is called aliasing.</p>

<h3 id="aliasing-in-each-band">Aliasing in Each Band</h3>

<p>Fast pairs alias quickly. With frequency near 1, the gap repeats about
every 6 tokens. Gap 1 and gap 7 look almost the same to a fast pair.</p>

<p>This may not seem good, but it is fine. Fast pairs are only meant for short range
precision. They do their job locally and we never care them for long
distances.</p>

<p>Slow pairs alias very slowly. With frequency near 0.0001, they do not repeat
until about 62832 tokens. Within any normal context, they never alias.</p>

<p>So each band repeats at its own distance. Fast pairs repeat every few
tokens. Slow pairs repeat only after tens of thousands.</p>

<h3 id="why-the-spectrum-saves-us">Why the Spectrum Saves Us</h3>

<p>Here is the beautiful part. A single pair aliases often. But the full set of
pairs almost never aliases all at once.</p>

<p>For two gaps to be truly indistinguishable, they would have to alias in
every pair at the same time. That means their difference would have to be a
whole number of turns for the fast period, the medium period, and the slow
period, all together.</p>

<p>So even though every pair aliases on its own, the combination of all pairs
gives each gap a unique fingerprint.</p>

<h3 id="the-real-danger-is-not-aliasing">The Real Danger Is Not Aliasing</h3>

<p>Aliasing means the model confuses two distances. That is not good, but there is
something worse, and we already saw it.</p>

<p>When a slow pair rotates past 180 degrees, its cosine turns negative. The
model is not just confused now. It is actively pushed the wrong way. A pair
that should support a long range link instead fights it.</p>

<p>This is the precise failure that appears at very long contexts. It is the structural reason that standard
RoPE struggles past its trained context length, and the reason researchers
built methods to go beyond it further.</p>

<h2 id="wrapping-up">Wrapping Up</h2>

<p>Let us revise and see the whole journey.</p>

<p>We started with the simplest idea, adding the raw position number, and
watched it fail on scale. We normalized it and it failed on consistency. We
moved to binary vectors and found a beautiful multi frequency structure, but
the jumps between positions broke smoothness.</p>

<p>Sinusoidal encoding fixed the smoothness. It gave every position a unique,
bounded, smooth vector, and its dot product quietly carried relative
position. But adding it to the embedding entangled position with meaning and
forced the model to untangle four terms.</p>

<p>RoPE fixed the entanglement. By rotating the query and key instead of adding
to them, it made the attention score depend cleanly on the semantic angle
and the relative distance, with no cross terms. It preserved meaning,
preserved length, and spread position across a spectrum of frequencies that
the model learns to use, fast pairs for nearby tokens and slow pairs for
distant ones.</p>

<p>But at very long distances the slow pairs eventually rotate
too far, the cosine flips, and attention is pushed the wrong way. That limit
is exactly what modern long context research works to mitigate the issue.</p>

<p>But for the sequence lengths that today’s models are trained on, RoPE is
clean, efficient, and effective. That is why it used inside almost every
modern large language model, from LLaMA to Mistral to Gemma to Phi.</p>

<h2 id="references">References</h2>

<ol>
  <li>
    <p>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, L., Polosukhin, I. (2017). <em>Attention Is All You Need.</em>
<a href="https://arxiv.org/pdf/1706.03762">arXiv:1706.03762</a></p>
  </li>
  <li>
    <p>Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. (2021).
<em>RoFormer: Enhanced Transformer with Rotary Position Embedding.</em>
<a href="https://arxiv.org/abs/2104.09864">arXiv:2104.09864</a></p>
  </li>
  <li>
    <p>Biderman, S., Black, S., Foster, C., Gao, L., Hallahan, E., He, H.,
Wang, B., Wang, P. (2021). <em>Rotary Embeddings: A Relative Revolution.</em>
EleutherAI Blog.
<a href="https://blog.eleuther.ai/rotary-embeddings/">blog.eleuther.ai/rotary-embeddings</a></p>
  </li>
  <li>
    <p>Fleetwood. <em>You could have designed state of the art positional
encoding.</em>
<a href="https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding">fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding</a></p>
  </li>
</ol>]]></content><author><name>Naveen Reddy</name><email>naveenreddyvarikuti@gmail.com</email></author><summary type="html"><![CDATA[Binary encoding gave us the right idea but the wrong shape. Here we replace square waves with smooth ones to derive sinusoidal positional encoding from scratch, and build up to Rotary Position Embeddings (RoPE) the method behind LLaMA, Mistral, and Gemma.]]></summary></entry><entry><title type="html">Positional Encoding Explained: From Position to Binary Encoding (Part 1)</title><link href="https://naveenreddyvarikuti.github.io/2026/05/23/positional-encoding-transformers-explained.html" rel="alternate" type="text/html" title="Positional Encoding Explained: From Position to Binary Encoding (Part 1)" /><published>2026-05-23T08:00:00+00:00</published><updated>2026-05-23T08:00:00+00:00</updated><id>https://naveenreddyvarikuti.github.io/2026/05/23/positional-encoding-transformers-explained</id><content type="html" xml:base="https://naveenreddyvarikuti.github.io/2026/05/23/positional-encoding-transformers-explained.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>Language models process text as a sequence of tokens. While token embeddings can represent the meaning of individual words, they do not inherently represent where those words appear in the sequence.</p>

<p>For example, consider these two sentences:</p>

<div class="example-pair">
  <div class="example-box">Dog bites man</div>
  <div class="example-box">Man bites dog</div>
</div>

<p>The words are identical but their meanings are completely different because the order changed.</p>

<p>Positional encodings are techniques that allow Transformer models to incorporate information about token positions. They help the model distinguish between different token orders and reason about relationships between tokens across a sequence.</p>

<p>Over the years, several approaches have been proposed for representing positional information. Some are simple and intuitive, while others are mathematically elegant and widely used in modern large language models.</p>

<p>In this blog, we will build these ideas step by step, starting from the simplest possible approaches and gradually moving toward the methods used in modern Transformers.</p>

<p>We will cover:</p>

<ul>
  <li>Direct position values</li>
  <li>Normalized position representations</li>
  <li>Binary position encodings</li>
  <li>Sinusoidal positional embeddings</li>
  <li>Rotary Position Embeddings (RoPE)</li>
</ul>

<p>For each approach, we will understand the intuition behind it, see how it works mathematically, identify its limitations, and understand how those limitations naturally motivate the next idea.</p>

<p>By the end of this blog, you should have a solid understanding of how positional information is represented in Transformers, why different positional encoding methods were developed, and why modern language models rely heavily on techniques such as RoPE.</p>

<h2 id="the-bag-of-words-problem">The Bag of Words Problem</h2>

<p>Consider two sentences: “Dog bites man” and “Man bites dog.”</p>

<p>Same three words. Completely different meanings. Ideally language model must
tell these apart.</p>

<p>A Transformer without positional encoding cannot distinguish between these two
sentences. The math makes it impossible.</p>

<h3 id="how-self-attention-computes-scores">How Self Attention Computes Scores</h3>

<p>Each token is first converted into an embedding vector. The model then projects
these embeddings into queries and keys using learned weight matrices \(W_q\) and
\(W_k\):</p>

\[Q_i = W_q \cdot e_i\]

\[K_j = W_k \cdot e_j\]

<p>The attention score between token \(i\) and token \(j\) is their dot product:</p>

\[\text{score}(i, j) = Q_i \cdot K_j = (W_q \cdot e_i) \cdot (W_k \cdot e_j)\]

<p>Based on the above formula.</p>

<p>The score depends on the embedding of token \(i\) and the embedding of token \(j\).
Nothing else. The positions related information for \(i\) and \(j\) do not appear anywhere in the formula.</p>

<h3 id="why-order-becomes-invisible">Why Order Becomes Invisible</h3>

<p>The word “dog” gets the same embedding vector whether it appears at position 1
or position 3. So the attention score between “dog” and “bites” is identical
in both sentences.</p>

<p>In “Dog bites man”:</p>

\[\text{score}(\text{dog}, \text{bites}) = (W_q \cdot e_{\text{dog}}) \cdot (W_k \cdot e_{\text{bites}})\]

<p>In “Man bites dog”:</p>

\[\text{score}(\text{dog}, \text{bites}) = (W_q \cdot e_{\text{dog}}) \cdot (W_k \cdot e_{\text{bites}})\]

<p>Identical. This holds for every pair of tokens.</p>

<p>The full attention matrix for “Dog bites man” is identical to the full attention
matrix for “Man bites dog.” Every value, every row, every column will be same.</p>

<p><img src="/assets/animations/Positional_Encoding/BagOfWordsProblem.gif" alt="Bag of Words animation" /></p>

<h3 id="permutation-invariance">Permutation Invariance</h3>

<p>This property is called <strong>permutation invariance</strong>.</p>

<p>Shuffle the tokens in any order and the attention scores do not change. “Dog
bites man” and “Man bites dog” and “Bites dog man” all produce the same
attention pattern.</p>

<p>Embeddings are looked up by token identity, not by position. Position simply
does not exist in the computation.</p>

<p>Without a mechanism to inject position, the Transformer is a bag of words
model. It knows which words are present. It does not know where they are.</p>

<p>This is the problem that positional encodings exist to solve</p>

<h2 id="idea-1-adding-raw-position-numbers">Idea 1: Adding Raw Position Numbers</h2>

<p>The Transformer needs to know where each token is. The simplest idea is to just
tell it directly by adding its position.</p>

<p>Take the position of each token as an integer and add it to the embedding.
Token at position 0 gets +0. Token at position 1 gets +1. Token at position
511 gets +511.</p>

<p><img src="/assets/animations/Positional_Encoding/Idea1.png" alt="Idea1 illustration" /></p>

<h3 id="the-idea">The Idea</h3>

<p>Every token embedding is a vector of numbers, typically around the range of -1 to
+1. The proposal is straightforward: take the position
index and add it as a scalar to every dimension of the embedding vector.</p>

\[e_i' = e_i + i\]

<p>where $i$ is the position of the token in the sequence.</p>

<p>For a sequence “Dog bites man”:</p>

<ul>
  <li>Position 0: $e_{\text{dog}}’ = e_{\text{dog}} + 0$</li>
  <li>Position 1: $e_{\text{bites}}’ = e_{\text{bites}} + 1$</li>
  <li>Position 2: $e_{\text{man}}’ = e_{\text{man}} + 2$</li>
</ul>

<p>Now “Dog bites man” and “Man bites dog” produce different embeddings because
the same word gets a different number added depending on where it appears.</p>

<p>But we will face an issue with this approach.</p>

<h3 id="the-scale-problem">The Scale Problem</h3>

<p>Embedding values typically live in the range of -1 to +1 (But the range of values can vary relatively to higher number). These are small,
carefully learned numbers that encode the meaning of each token.</p>

<p>Now consider what happens at position 500. We add 500 to every dimension of
the embedding. A dimension that was 0.3 becomes 500.3. A dimension that was
-0.7 becomes 499.3.</p>

<p>The positional number completely dominates the embedding. The semantic content
of the token is over shadowed by a massive positional
value. The model can barely see embedding of the token.</p>

<p>At position 0, the embedding is untouched. At position 500, the embedding is
almost entirely overwritten by the position value. Tokens near the beginning
of a sequence and tokens near the end, live in completely different numerical
ranges, not only because they mean different things, but also because of where they
appear.</p>

<p><img src="/assets/animations/Positional_Encoding/ScaleMismatch.gif" alt="Scale Mismatch animation" /></p>

<h3 id="no-upper-bound">No Upper Bound</h3>

<p>This approach has no fixed range. The position value grows without limit as
the sequence gets longer.</p>

<p>A model trained on sequences of length 512 has seen position values from 0 to
511 . At inference time, if the input has 1024 tokens, the model suddenly sees
position values up to 1023. It has never encountered numbers this large during
training.</p>

<p>This is an out of distribution problem. The model has no way to generalize to
positions it has never seen.</p>

<h3 id="inconsistent-distance">Inconsistent Distance</h3>

<p>The absolute difference between position 1 and
position 2 is 1. The absolute difference between position 500 and position
501 is also 1.</p>

<p>But relative to the position values themselves, these gaps are very different.</p>

<p>The model cannot learn a consistent notion of “subsequent positions” because the
same gap of absolute positional difference of 1 looks completely different depending on where in the sequence
it occurs.</p>

<p>Three issues make raw integer positions unusable:</p>

<ul>
  <li><strong>Scale mismatch.</strong> Large position values dominate out the semantic content of
embeddings. A token’s meaning becomes invisible behind its position number.</li>
  <li><strong>No upper bound.</strong> Position values grow without limit. The model cannot
generalize to sequence lengths it has not seen during training.</li>
  <li><strong>Inconsistent distances.</strong> The same gap between two positions looks
different depending on absolute position. The model cannot learn a uniform
sense of distance.</li>
</ul>

<p>The position values are unbounded and
live on a completely different scale than the embeddings.</p>

<p>What if we fix the scale problem by forcing all position values into a fixed
range?.</p>

<h2 id="idea-2-normalized-positions">Idea 2: Normalized Positions</h2>

<p>Raw integer positions failed because the values were too large. They
overwhelmed the embeddings and had no upper bound.</p>

<p>Instead can we just: squeeze all position values into the range [0, 1].</p>

<h3 id="the-idea-1">The Idea</h3>

<p>Divide each position by the length of the sequence minus one.</p>

\[PE(pos) = \frac{pos}{L - 1}\]

<p>where $L$ is the total number of tokens in the sequence.</p>

<p>For a sequence of length 512:</p>

<ul>
  <li>Position 0 → 0.0</li>
  <li>Position 255 → 0.5</li>
  <li>Position 511 → 1.0</li>
</ul>

<p>Every position now maps to a value between 0 and 1. No matter how long the
sequence is, the values never exceed 1. They sit comfortably in the same
range as the embedding values.</p>

<p>The scale problem is gone. The model no longer has to deal with position
values like 500 drowning out embedding values like 0.3.</p>

<p>So this works?</p>

<h3 id="the-spacing-problem">The Spacing Problem</h3>

<p>Consider two sequences of different lengths.</p>

<p>A short sequence with 10 tokens:</p>

\[[0.0,\ 0.11,\ 0.22,\ 0.33,\ 0.44,\ 0.56,\ 0.67,\ 0.78,\ 0.89,\ 1.0]\]

<p>The spacing between consecutive positions is 0.11.</p>

<p>A long sequence with 1000 tokens:</p>

\[[0.0,\ 0.001,\ 0.002,\ 0.003,\ \dots,\ 0.999,\ 1.0]\]

<p>The spacing between consecutive positions is 0.001.</p>

<p>The gap between adjacent tokens is 100 times smaller in the long sequence
than in the short sequence. Two tokens that are “one step apart” look very
different to the model depending on sequence length.</p>

<h3 id="same-position-different-values">Same Position, Different Values</h3>

<p>The same position index maps to completely different
values depending on the sequence length.</p>

<p>Position 5 in a 10 token sequence:</p>

\[PE(5) = \frac{5}{9} = 0.556\]

<p>Position 5 in a 1000 token sequence:</p>

\[PE(5) = \frac{5}{999} = 0.005\]

<p>The fifth token gets the value 0.556 in one case and 0.005 in the other.
These are not even close.</p>

<p>The model cannot learn what “position 5” means because the value it receives
changes with every input. A model trained mostly on short sequences will
associate 0.5 with the middle of a sentence. When it sees a long sequence
where 0.5 maps to position 500, the learned association breaks.</p>

<h3 id="why-this-is-fundamental">Why This Is Fundamental</h3>

<p>The root cause is that this scheme is <strong>relative to sequence length</strong>. It
does not encode absolute position. It encodes “how far through the sequence
are we.”</p>

<p>Position 0 always means “beginning.” Position 1.0 always means “end.” But
everything in between shifts depending on $L$.</p>

<p>This creates two failures:</p>

<ul>
  <li><strong>No consistent position identity.</strong> The same position index produces
different values for different sequence lengths. The model cannot learn
a stable representation for any position.</li>
  <li><strong>No consistent spacing.</strong> The distance between consecutive positions
depends on $L$. The model cannot learn a uniform notion of “adjacent
tokens” because the numerical gap changes per sequence.</li>
</ul>

<h3 id="what-we-need-instead">What We Need Instead</h3>

<p>Both attempts so far used a single number to represent each position. The
first attempt used numbers that were too large. The second attempt used
numbers that changed meaning depending on context length.</p>

<p>What if instead of a single number, we represented each position as a
vector? And what if that vector used a fixed, length independent pattern
that gave every position a unique and consistent representation?</p>

<p>This is exactly what binary encoding do.</p>

<h2 id="idea-3-binary-encoding">Idea 3: Binary Encoding</h2>

<p>Both previous approaches used a single number to represent each position.
That single number was either too large or too unstable across sequence
lengths.</p>

<p>A different idea: represent each position as a vector of bits.</p>

<h3 id="positions-as-binary-vectors">Positions as Binary Vectors</h3>

<p>Every integer can be written in binary. We can use this binary representation
directly as a position encoding vector.</p>

<p>For a 9 bit encoding:</p>

<table>
  <thead>
    <tr>
      <th>Position</th>
      <th>Binary Vector</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0]</td>
    </tr>
    <tr>
      <td>1</td>
      <td>[0, 0, 0, 0, 0, 0, 0, 0, 1]</td>
    </tr>
    <tr>
      <td>2</td>
      <td>[0, 0, 0, 0, 0, 0, 0, 1, 0]</td>
    </tr>
    <tr>
      <td>5</td>
      <td>[0, 0, 0, 0, 0, 0, 1, 0, 1]</td>
    </tr>
    <tr>
      <td>255</td>
      <td>[0, 1, 1, 1, 1, 1, 1, 1, 1]</td>
    </tr>
    <tr>
      <td>511</td>
      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1]</td>
    </tr>
  </tbody>
</table>

<p>Each position gets a unique vector of 0s and 1s. The dimensionality of the
vector is $\lceil \log_2(L) \rceil$, where $L$ is the maximum sequence
length. For a sequence of up to 512 tokens, we need 9 bits.</p>

<h3 id="what-binary-encoding-gets-right">What Binary Encoding Gets Right</h3>

<p>This approach fixes every problem from the previous two attempts.</p>

<p><strong>Bounded values.</strong> Every entry in the vector is either 0 or 1. No position
value ever exceeds 1. There is no risk of drowning out the embedding.</p>

<p><strong>Unique per position.</strong> Every integer has a distinct binary representation.
No two positions share the same vector. Position 5 is always
[0, 0, 0, 0, 0, 0, 1, 0, 1], regardless of how long the sequence is.</p>

<p><strong>Length independent.</strong> Unlike normalized positions, the encoding of position
5 does not change when the sequence length changes. Position 5 is the same
vector whether the sequence has 10 tokens or 10,000 tokens.</p>

<p><strong>Fixed dimensionality.</strong> The encoding uses $\lceil \log_2(L) \rceil$
dimensions. This grows very slowly. 10 bits can handle sequences up to 1024.
20 bits can handle sequences up to 1,048,576.</p>

<p>But something interesting is hidden in how these bits change across
positions. Before we look at the problems, let us first look at the
structure.</p>

<h3 id="the-frequency-pattern-in-binary">The Frequency Pattern in Binary</h3>

<p>The binary representations for positions 0 through 7 is as below and if we look at
each bit column separately.</p>

<table>
  <thead>
    <tr>
      <th>Position</th>
      <th>Bit 2 ($2^2$)</th>
      <th>Bit 1 ($2^1$)</th>
      <th>Bit 0 ($2^0$)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <td>2</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
    </tr>
    <tr>
      <td>3</td>
      <td>0</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <td>4</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>5</td>
      <td>1</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <td>6</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
    </tr>
    <tr>
      <td>7</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

<p>Now read each column from top to bottom.</p>

<p><strong>Bit 0</strong> (the rightmost, least significant bit) flips every single position:
0, 1, 0, 1, 0, 1, 0, 1. It completes a full cycle every 2 positions.</p>

<p><strong>Bit 1</strong> flips every 2 positions: 0, 0, 1, 1, 0, 0, 1, 1. It completes a
full cycle every 4 positions.</p>

<p><strong>Bit 2</strong> flips every 4 positions: 0, 0, 0, 0, 1, 1, 1, 1. It completes a
full cycle every 8 positions.</p>

<p>The frequency of that wave depends on which bit position it is.</p>

<p><img src="/assets/animations/Positional_Encoding/Binary_Encoding_Frequency_pattern.png" alt="Binary Frequency Pattern" /></p>

<h3 id="lsb-vs-msb-fast-bits-and-slow-bits">LSB vs MSB: Fast Bits and Slow Bits</h3>

<p>This pattern generalizes to any number of bits. For bit position $i$
(counting from the right, starting at 0):</p>

\[\text{Oscillation period of bit } i = 2^{i+1} \text{ positions}\]

<p>The <strong>least significant bit</strong> (LSB, rightmost, $i = 0$) oscillates the
fastest. It flips at every single position. It has a period of 2.</p>

<p>The <strong>most significant bit</strong> (MSB, leftmost, $i = d-1$) oscillates the
slowest. For a 9 bit encoding, it flips every 256 positions. It has a period
of 512.</p>

<p>Bits on the right change rapidly. Bits on the left change
slowly. Each bit position captures positional information at a different
scale.</p>

<table>
  <thead>
    <tr>
      <th>Bit Position</th>
      <th>Flips Every</th>
      <th>Period</th>
      <th>Role</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bit 0 (LSB)</td>
      <td>1 position</td>
      <td>2</td>
      <td>Finest grain, changes constantly</td>
    </tr>
    <tr>
      <td>Bit 1</td>
      <td>2 positions</td>
      <td>4</td>
      <td> </td>
    </tr>
    <tr>
      <td>Bit 2</td>
      <td>4 positions</td>
      <td>8</td>
      <td> </td>
    </tr>
    <tr>
      <td>Bit 3</td>
      <td>8 positions</td>
      <td>16</td>
      <td> </td>
    </tr>
    <tr>
      <td>…</td>
      <td>…</td>
      <td>…</td>
      <td> </td>
    </tr>
    <tr>
      <td>Bit 8 (MSB)</td>
      <td>256 positions</td>
      <td>512</td>
      <td>Coarsest grain, barely changes</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/animations/Positional_Encoding/oscilation_frequency_by_bit.png" alt="Osciallation Frequency" /></p>

<h3 id="visualizing-the-square-waves">Visualizing the Square Waves</h3>

<p>If you plot the value of each bit across all positions, you see a series of
square waves stacked on top of each other. Each wave has exactly half the
frequency of the one below it.</p>

<p><img src="/assets/animations/Positional_Encoding/square_wave_patterns.png" alt="Square Wave Patterns" /></p>

<p>This is a multi frequency encoding. The lowest bit gives fine grained
position information (is this an even or odd position?). The highest bit
gives coarse position information (are we in the first half or second half
of the sequence?).</p>

<p>This multi frequency structure is the most important observation
about binary encoding. It will directly motivate sinusoidal positional
encoding in the next subsequent blog.</p>

<h3 id="the-discontinuity-problem">The Discontinuity Problem</h3>

<p>Despite this awesome frequency structure, binary encoding has a flaw.</p>

<p>Look at positions 3 and 4:</p>

<ul>
  <li>Position 3: [0, 1, 1]</li>
  <li>Position 4: [1, 0, 0]</li>
</ul>

<p>These two positions are adjacent. They are one step apart. But their binary
vectors differ in all three bits. The distance between them in vector space
is large.</p>

<p>Now look at positions 2 and 3:</p>

<ul>
  <li>Position 2: [0, 1, 0]</li>
  <li>Position 3: [0, 1, 1]</li>
</ul>

<p>Also adjacent. Also one step apart. But only one bit differs. The distance
between them is small.</p>

<p><img src="/assets/animations/Positional_Encoding/BinaryDiscontinuity_ManimCE.gif" alt="Discontinuity Problem" /></p>

<p>Adjacent positions have wildly inconsistent distances in the encoding space.
The transition from 3 to 4 is a large jump. The transition from 2 to 3 is
a tiny step. There is no smooth relationship between position and encoding.</p>

<p>This happens because binary numbers carry over. When all lower bits are 1,
the next increment flips them all to 0 and flips the next higher bit to 1.
These carry overs cause sudden large changes in the vector for what should
be a small step in position.</p>

<h3 id="why-discontinuity-matters">Why Discontinuity Matters</h3>

<p>Neural networks learn smooth functions. They work best when small changes
in input produce small changes in output. If two positions are close
together, their encodings should also be close together.</p>

<p>Binary encoding violates this. The model cannot learn a smooth notion of
“nearby positions” because the encoding jumps unpredictably between
adjacent positions.</p>

<h3 id="what-we-keep-what-we-fix">What We Keep, What We Fix</h3>

<p>Binary encoding gave us two valuable ideas:</p>

<ul>
  <li><strong>Multi frequency structure.</strong> Different bits capture position at different
scales. Fast bits for fine detail, slow bits for coarse structure.</li>
  <li><strong>Vector representation.</strong> Each position is a vector, not a single number.</li>
</ul>

<p>But it also has one critical issue:</p>

<ul>
  <li><strong>Discrete jumps.</strong> The square wave transitions between 0 and 1 are
discontinuous. Adjacent positions can have very different encodings.</li>
</ul>

<p>The fix is can be simple. Replace the square waves with smooth waves.
Replace the discrete 0/1 flips with continuous sine and cosine functions.</p>

<p>Keep the multi frequency structure. Make it smooth.</p>

<p>This is exactly what sinusoidal positional encoding does.Lets discuss about this in subsequent next blog!</p>

<p><a href="https://naveenreddyvarikuti.github.io/2026/06/06/positional-encoding-sinusoidal-and-rope.html">Part 2</a> of this series covers sinusoidal positional encoding and
Rotary Position Embeddings (RoPE) the method used in nearly every modern
large language model including LLaMA, Mistral, and Gemma.</p>

<h2 id="references">References</h2>

<ol>
  <li>
    <p>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, L., Polosukhin, I. (2017). <em>Attention Is All You Need.</em>
<a href="https://arxiv.org/pdf/1706.03762">arXiv:1706.03762</a></p>
  </li>
  <li>
    <p>Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. (2021).
<em>RoFormer: Enhanced Transformer with Rotary Position Embedding.</em>
<a href="https://arxiv.org/abs/2104.09864">arXiv:2104.09864</a></p>
  </li>
  <li>
    <p>Biderman, S., Black, S., Foster, C., Gao, L., Hallahan, E., He, H.,
Wang, B., Wang, P. (2021). <em>Rotary Embeddings: A Relative Revolution.</em>
EleutherAI Blog.
<a href="https://blog.eleuther.ai/rotary-embeddings/">blog.eleuther.ai/rotary-embeddings</a></p>
  </li>
  <li>
    <p>Fleetwood. <em>You could have designed state of the art positional
encoding.</em>
<a href="https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding">fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding</a></p>
  </li>
</ol>]]></content><author><name>Naveen Reddy</name><email>naveenreddyvarikuti@gmail.com</email></author><summary type="html"><![CDATA[Language models process text as a sequence of tokens. While token embeddings can represent the meaning of individual words, they do not inherently represent where those words appear in the sequence.]]></summary></entry></feed>