Joe Buschmann

let topics = [csharp; specflow; fun]

Text Parsing with F#

I’ve always had a fear of parsing text.  Strange since it’s a basic programmer’s task.  I’ve always found it to be tedious and boring, and the resulting code is a jumbled mess of indexes and string parsing methods.  I avoid it every chance I get.

And there’s regex.  The syntax is cryptic and difficult to remember.  It’s another thing I should know well, but, again, I avoid it.

Recently I decided to come to grips with this weakness and parse a text file containing information about TCP sockets in a TIME_WAIT status.  It was generated by a co-worker investigating network performance.  He wanted to get a list of IP addresses along with the amount of time they spent with a status of TIME_WAIT.  Then he could pull it into Excel to visualize the data.

I’ve been told that functional languages are great for parsing text, so I decided to use F#.  While doing some research, I came to a sudden realization.  The reason why it has always been difficult for me was I had been taking the wrong approach.  I was intermingling the parsing code with the business logic code.  My approach was along the lines of:  parse a line, pull some string values from previous lines from state, run some logic for the current line, store the string values in state, and move to the next line.  Included in the code were lots of index values into the text, helper methods, and conversions to data types like dates, integers, etc.  All that mixed in with the business logic.  No wonder I hated it.

The correct approach (or at least a better one) is to convert the textual representation of the data into data structures before running any business logic.  There should be a clear separation between the two.  With this step, the logic becomes easier.  There are no indexes or type conversions.  All that’s been done, and the business logic is focused and concise.

The Data Structures

Before I could continue, I needed to understand the data represented in the file from my co-worker and map them to data structures.  A portion of the file is below.

Time : 2011-11-09_08:36:19
Time : 2011-11-09_08:36:24
Time : 2011-11-09_08:36:29
Time : 2011-11-09_08:36:34
Time : 2011-11-09_08:36:40
Time : 2011-11-09_08:36:45
Time : 2011-11-09_08:36:50
Time : 2011-11-09_08:36:56
Time : 2011-11-09_08:37:01
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0
Time :  2011-11-09_08:37:06
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0
TCP        TIME_WAIT       0

Each line was written in intervals of about 5 seconds.  There seemed to be two types of information recorded, a timestamp and an IP address of a socket with a status of TIME_WAIT, and each needed its own data structure.  The IP address lines were consistent in their format; however, the timestamps were not.  Sometimes there was a line break between the “Time :” label and the value and other times they were on the same line.  The table below breaks down each type of line format and the action to take.

Time : 2011-11-09_08:37:01 Create a data structure representation
Time : Ignore
2011-11-09_08:37:11 Create a data structure representation
TCP    TIME_WAIT    0 Create a data structure representation


To represent the data, I used a discriminated union with two discriminators that store the timestamp and IP address.  For the lines with the IP addresses, there are two addresses, and only the first one is useful because it is the one in a TIME_WAIT state.  The remaining information is superfluous and can be discarded.

type LineRecord =
| TimeRecord of DateTime
| TimeWaitRecord of int * int * int * int * int

Active Patterns

To parse each line into a LineRecord, I used an F# feature called active patterns which allowed me to set up pattern matching expressions for text values just as I would for a discriminated union.  I created four active pattern definitions.  The first two take a string value and attempt to return a date or an IP address.  The next two take a line of text from the file and return a LineRecord discriminator value.  Once these were in place, writing a function to convert the text file into a list of LineRecord values became trivial.

// Attempt to convert a string value into a date.
let (|Date|_|) (input:string) =
  let (success, date) = DateTime.TryParse(input)
  if (success) then Some(date)
  else None

// Attempt to convert a string value into an IP address
// represented by a tuple with four or five integers with
// the port being optional.
let (|IP|_|) (input:string) =
let delimChars = [|'.'; ':'|]
let arr = input.Split(delimChars)
match arr.Length with
| 4 -> Some((int(arr.[0]), int(arr.[1]), int(arr.[2]), int(arr.[3]), 0))
| 5 -> Some((int(arr.[0]), int(arr.[1]), int(arr.[2]), int(arr.[3]),
| _ -> None

// Attempt to convert a time record.
let (|TimeEntry||) (line:string) =
let line = line.Trim().Replace("Time : ", "").Replace('
', ' ')
match line with
| Date dateTime -> Some(TimeRecord(dateTime))
| _ -> None

// Attempt to convert a TIME_WAIT record.
let (|TimeWaitEntry|_|) (line:string) =
let line = line.Replace("TCP", "").Trim()
let index = line.IndexOf(' ')

match index with
| -1 -> None
| index ->
line = line.Substring(0, index)
match line with
| IP (ip) -> Some(TimeWaitRecord ip)
| _ -> None

// Parses lines from the file into a list of LineRecords.
let parseFile lines =
let lines = lines |> List.rev
let rec parseFileUtil lines acc =
match lines with
| hd :: tl ->
acc = match hd with
| TimeEntry timeRecord -> timeRecord :: acc
| TimeWaitEntry timeWaitRecord -> timeWaitRecord :: acc
| _ -> acc
parseFileUtil tl acc
| [] -> acc
parseFileUtil lines []

Running the Business Logic

Now that the contents of the file were represented neatly as a data structure, it was much easier to manipulate the data.  For example, I could get a list of times and the number of IP addresses in a TIME_WAIT state and save this list to a CSV file.

open System
open System.Collections.Generic
open System.IO
open DataStructures

let lines = File.ReadAllLines("out_time0.txt") |> Array.toList
let items = parseFile lines

// Find the times of each TIME_WAIT record and return a tuple
// with the time and IP address. Since the time was written
// to the file once for a set of addresses, it has to be
// passed into the util function as state. The accumulator
// parameter exists to enable tail call optimization.
let getTimeWaitTimes items =
let rec util items (currentTime:DateTime) acc =
match items with
| hd :: tl ->
currentTime = match hd with
| TimeRecord newTime -> newTime
| _ -> currentTime

    <span style="color: blue">let </span>acc = <span style="color: blue">match </span>hd <span style="color: blue">with
              </span>| TimeWaitRecord (val1, val2, val3, val4, val5) <span style="color: blue">-&gt;
                </span>(currentTime, (val1, val2, val3, val4, val5)) :: acc
              | _ <span style="color: blue">-&gt; </span>acc

    <span style="color: green">//printfn "Current Time %s" (currentTime.ToString())
    </span>util tl currentTime acc
  | [] <span style="color: blue">-&gt; </span>acc

util items DateTime.MinValue []

let timeWaits = getTimeWaitTimes items

// Create a map of the number of IP addresses in a
// TIME_WAIT state at a give time. The time is the key.
let counts = timeWaits |> List.fold (fun (state:Map<,>) (time, _) ->
(state.ContainsKey(time)) then
state |>
(fun k v ->
k = time then
v + 1
else v)
state.Add(time, 1)
) (Map.empty)

// Write the results to a CSV file.
let file = File.CreateText("results.csv")

counts |> Map.iter (fun k v -> file.WriteLine("{0},{1}", k, v) )



You can find the full source code including unit tests on GitHub.

This was a simple example, but it demonstrates the power of converting the textual data to an internal representation.  With the parsing done, the business logic comes together much easier.  Also, the active pattern language feature in F# makes the textual analysis simpler.

Now, if I could only master regex.