Imperative to compose example

Changing a piece of imperative code to be purely functional

While happily coding my ggit project I wrote the following function that returns a list of filenames that have been changed (according to Git version control).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// src/changed-files.js
var log = require('debug')('ggit');
// exec is promise-returning shell command runner
var exec = require('./exec');
function parseLine(line) {
var parts = line.split('\t');
return {
diff: parts[0],
name: parts[1]
};
}
function groupByModification(parsedLines) {
return _.groupBy(parsedLines, 'diff');
}
function changedFiles() {
var cmd = 'git diff --name-status --diff-filter=ACDM');
log('filter letters Added (A), Copied (C), Deleted (D), Modified (M)');
log('changed files command', cmd);
return exec(cmd)
.then(function (data) {
data = data.trim();
var files = data.split('\n');
files = files.filter(function (filename) {
return filename.length;
}).map(parseLine);

log('found changed files');
log(files);

var grouped = groupByModification(files);
log('grouped by modification');
log(grouped);
return grouped;
});
}

The code is running a git command git diff --name-status --diff-filter=ACDM that returns a list of filenames and their modification letters. The output from the command is something like this for a case if you modified source file src/foo.js and added file README.md

M\tsrc/foo.js
A\tREADME.md

The output was a single object like this

1
2
3
4
5
6
7
8
9
10
11
{
A: [...], // list of added files
C: [...], // list of copied files
M: [...], // list of modified files
D: [...] // list of deleted files
}
// each item in the list is
{
diff: 'A' // or C, M, D
name: 'src/something.js' // relative to the repo root
}

This code essentially ran the following 4 steps

1: parse git stdout output
2: run the debug log command
3: group the files by modification
4: debug print the group

The final group is returned by the function. That is it. Yet this code is very error-prone. We need to keep track of the local variables (files), input arguments (data), return variable (grouped).

When there are 4 steps and 3 variables, they can all interact, causing the number of possible effects to go up to 12 (= 4 * 3). This is too complex for the human mind to keep track of; short term memory can only "cache" from 4 to 7 things at once.

Can we refactor this code to eliminate the variables? Yes! We are going to use functional composition to eliminate local variables and make the data flow stricter. We are also going to factor out individual pure functions that only work on the input arguments, making reasoning about them much simpler.

step 1

Factor every little data processing into its own function

We have 4 steps in this computation. Let us split it into 4 functions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
function parseLine(line) {
var parts = line.split('\t');
return {
diff: parts[0],
name: parts[1]
};
}
function parseOutput(data) {
data = data.trim();
var files = data.split('\n');
files = files.filter(function (filename) {
return filename.length;
}).map(parseLine);
return files;
}
function groupByModification(parsedLines) {
return _.groupBy(parsedLines, 'diff');
}
function logFoundFiles(files) {
log('found changed files');
log(files);
}

function logGroupedFiles(grouped) {
log('grouped by modification');
log(grouped);
}
function changedFiles() {
var cmd = 'git diff --name-status --diff-filter=ACDM');
log('filter letters Added (A), Copied (C), Deleted (D), Modified (M)');
log('changed files command', cmd);
return exec(cmd)
.then(function outputToGroup(data) {
var files = parseOutput(data);
logFoundFiles(files);
var grouped = groupByModification(files);
logGroupedFiles(grouped);
return grouped;
});
}

Excellent, each small function is easy to reason about in isolation. Also, we could move them into a separate file and quickly unit test. The main block of code we are going to refactor is now very clear. I like giving functions names; makes debugging crashes much simpler

1
2
3
4
5
6
7
function outputToGroup(data) {
var files = parseOutput(data);
logFoundFiles(files);
var grouped = groupByModification(files);
logGroupedFiles(grouped);
return grouped;
}

step 2

Replace imperative steps with composition

The output of each function call in outputToGroup is fed into the next function (except for log messages). Imagine for a second that functions logFoundFiles and logGroupedFiles returned whatever the first argument passed to them. Then the function outputToGroup could be written like this

1
2
3
function outputToGroup(data) {
return logGroupedFiles(groupByModification(logFoundFiles(parseOutput(data))));
}

We don't even need an actual function outputToGroup - we could make this "caterpillar" on the fly using an utility from any functional library: lodash.compose, or ramda.compose.

1
2
3
4
5
6
7
8
9
var R = require('ramda');
var outputToGroup = R.compose(logGroupedFiles, groupByModification, logFoundFiles, parseOutput);
function changedFiles() {
var cmd = 'git diff --name-status --diff-filter=ACDM');
log('filter letters Added (A), Copied (C), Deleted (D), Modified (M)');
log('changed files command', cmd);
return exec(cmd)
.then(outputToGroup);
}

That is very cool and literally eliminates any place in the code for an error to hide. We only have a single problem: functions logGroupedFiles and logFoundFiles do NOT return the first argument, thus we cannot use them inside compose: they stop the flow of data!

Luckily, it is simple to work around this problem. We can adapt the function on the fly by creating a new function that DOES return its arguments, but calls the original function. It is called tap and one can either implement tap or use a 3rd party utility, like ramda.tap

1
2
logFoundFiles('foo'); // prints "foo", returns undefined
R.tap(logFoundFiles)('foo'); // prints "foo", returns 'foo'

We just need to add taps around any function in our composition that should be "ignored", and whose original arguments should be just passed to the next step

1
2
3
4
var R = require('ramda');
var outputToGroup = R.compose(
R.tap(logGroupedFiles), groupByModification, R.tap(logFoundFiles), parseOutput
);

I indented the code a little bit differently for clarity. Even better in my personal opinion to use ramda.pipe which is equivalent to R.compose except the functions are in reverse order. To me the R.pipe goes from left to right, which is very natural because the code it replaces (imperative) goes from top to bottom.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// original imperative code from top to bottom
function outputToGroup(data) {
var files = parseOutput(data);
logFoundFiles(files);
var grouped = groupByModification(files);
logGroupedFiles(grouped);
return grouped;
}
// equivalent functional code using R.pipe
var R = require('ramda');
var outputToGroup = R.pipe(
parseOutput,
R.tap(logFoundFiles),
groupByModification,
R.tap(logGroupedFiles)
);

The functional code is in the same order as original, except we literally listed what functions to run in the same order, but the flow of data is managed for us. The output of parseOutput will be fed to both logFoundFiles and groupByModification. The output of logFoundFiles is ignored via R.tap. The output from groupByModification which runs after logFoundFiles finishes will be fed to logGroupedFiles. The output of logGroupedFiles is ignored due to R.tap. The pipe will instead return the output of groupByModification call.

Bonus - unit testing outputToGroup

We have refactored individual steps into separate functions, and composed the final logic into a function stored in a private variable var outputToGroup = .... Can we unit test the individual functions or the outputToGroup function? Usually one needs to export a function from a module to be able to unit test it. Too much trouble, we can do better. Using describe-it we can get access to pretty much anything inside a module without exporting it.

changed-files.js
1
2
3
4
5
6
7
8
9
function parseOutput(stdout) { ... }
function logFoundFiles(...) { ... }
...
var outputToGroup = R.pipe(
parseOutput,
R.tap(logFoundFiles),
groupByModification,
R.tap(logGroupedFiles)
);
spec/changed-files-spec.js
1
2
3
4
5
6
7
8
9
var describeIt = require('describe-it');
describeIt('../src/changed-files', 'function parseOutput(stdout)', function (getParseOutput) {
it('works', function () {
// get access to the private parseOutput function!
var parseOutput = getParseOutput();
var parsed = parseOutput('M\tfoo.js');
// verify parsed
});
});

Pretty cool - we got access to parseOutput in our unit tests without modifying any code inside the changed-files.js source file. To see the complete unit test, including tests that access the outputToGroup function, take a look at the changed-files-spec.js.

Additional reading