Indexing a book using bash

Save yourself work by making the process of indexing more efficient

Photo by Sebastian Pena Lambarri on Unsplash

Making an Index: How?

A second\index{second} is one of the fundamental units of time.

The obvious solution: go over manually

Replace All with a GUI?

Enter bash: find and replace with sed

A \textbf{second} is one of the fundamental units of time.
There are 60 seconds for each \textbf{minute}.
Beneath the level of the second is the \textbf{millisecond}, the \textbf{nanosecond} and the \textbf{microsecond}.
Second, the term, was borrowed into English from Old French \cite{wiktionary}.
Second to none, the second is the easiest unit to count with.
sed -e ‘s/second/hour/g’ chapter_1.tex
Find and replace with sed
sed -i ‘s/second/hour/g’ chapter_1.tex
screenshot showing second replaced by hour in the file
screenshot showing second replaced by hour in the file
Note the differing outputs of running cat chapter_1.tex before and after sed.
screenshot showing second and Second replaced by hour in the file
screenshot showing second and Second replaced by hour in the file
Using a basic regular expression with sed

Making a bash function for indexing

## to be revised…
function index() {
sed -e ‘s/’”$1"’/’”$2"’/g’ $3
}
First attempt at the index function
First attempt at the index function
Nice!

Looping over files

## to be revised..
function index() {
for FILE in *.tex
do
sed -e 's/'"$1"'/'"$2"'/g' $FILE
done
}
A \textbf{second} after he left, the phone rang.
He would have only had time to talk for a \textbf{minute} anyway.
But obviously he could talk for more than a \textbf{millisecond}, a \textbf{nanosecond} and a \textbf{microsecond}.
Second to none, the second he left was the biggest regret of his life.
running our preliminary index function over two chapters. The word ‘second’ in both files has been replaced by ‘hour’ in the
running our preliminary index function over two chapters. The word ‘second’ in both files has been replaced by ‘hour’ in the
Now we’re looping

Making the actual index function

## to be revised..
function index() {
for FILE in *.tex
do
sed -e 's/'"$1"'/'"$1"'\\index\{'"$2"'\}/g' $FILE
done
}
using the query as part of the replacement fails with regular expressions
using the query as part of the replacement fails with regular expressions
This is clearly not what we want.
## to be revised..
function index() {
for FILE in *.tex
do
sed -e 's/\('"$1"'\)/\1\\index\{'"$2"'\}/g' $FILE
done
}
using the query as part of the replacement fails with regular expressions gets the right result with group variable
using the query as part of the replacement fails with regular expressions gets the right result with group variable
adding in a word boundary in the query means we don’t index microsecond, for instance
adding in a word boundary in the query means we don’t index microsecond, for instance
We’re no longer indexing, eg., ‘microsecond’ with ‘second’.

Adding in checking

## to be revised..
function index() {
for FILE in *.tex
do
grep -nH --color=always $1 $FILE | sed -e 's/\('"$1"'\)/\1\\index\{'"$2"'\}/g'
done
}
output showing first filtering the query through grep, with file and line numbers
output showing first filtering the query through grep, with file and line numbers
Note the use of the single quotes in the query when we run the command. That’s important for `grep` to correctly identify `\b`.
## Final version!
function index() {
ARG1=$1
ARG2=$2
for FILE in *.tex
do
grep -nH --color=always $ARG1 $FILE | sed -e 's/\('"$ARG1"'\)/\1\\index\{'"$ARG2"'\}/g'
done
read -p 'Is this correct (y/n): ' VALIDATION
if [ ${VALIDATION::1} == Y ] || [ ${VALIDATION::1} == y ]
then
echo "Applying changes to file."
for SOURCE in *.tex
do
sed -i 's/\('"$ARG1"'\)/\1\\index\{'"$ARG2"'\}/g' $SOURCE
done
else
echo "Function aborted, go and refine the query."
fi
}
index ‘\b[sS]econd’ second
Running the final function and making the changes in file
Running the final function and making the changes in file
Running the full function. Note again the second `cat` output, where the changes have been made in-file.