Learning to be Giant.

Use RethinkDB to extract distinct tags with Map-reduce

|

This is originally a question I asked on StackOverflow, but I came up with the answer by myself.

I have a database whose documents are of the format:

    {
        'timestamp': 1431307846643,
        'tag': 'tag1',
        'message': 'message content'
    }

What I want to do is to:

  1. group all the documents by message
  2. find the earliest and latest timestamp from each group
  3. find all the distinct tags in a group

I would like the result to be something like:

    {
        'message': 'message content shared by the group',
        'earliest': 1431307840000,
        'latest': 1431307849999,
        'tags': ['tag1', 'tag2', 'tag3']
    }

I tried the following code in data explorer:

    r.db('DB').table('TB').group('message').map(function(doc){
      return {
        count: 1,
        from: doc('timestamp'),
        to: doc('timestamp'),
        tags: <PLACEHOLDER_1>
      }
    }).reduce(function(left, right){
      return {
        count: left('count').add(right('count')),
        from: r.branch(left('from').lt(right('from')), left('from'), right('from')),
        to: r.branch(left('to').gt(right('to')), left('to'), right('to')),
        tags: <PLACEHOLDER_2>
      }
    })

Problems are:

  1. If I fill <PLACEHOLDER_1> with doc('tag') and <PLACEHOLDER_2> with left('tags').setInsert(right('tags')), I get RqlRuntimeError: Expected type ARRAY but found STRING in: ....
  2. So I changed <PLACEHOLDER_1> to r([]).append(doc('tag')). Because I do not know how to access an array element in ReQL, I left <PLACEHOLDER_2> unchanged. This results in something like:
         {
             "group": "Message Content",
             "reduction": {
                 "tags": [
                     "Mobile Safari",
                     [
                         "Mobile Safari"
                     ],
                     [
                         "Mobile Safari",
                         [
                             "Mobile Safari"
                         ]
                     ]
                 ],
                 "count": 21,
                 "earliest": 1431178639364,
                 "latest": 1431584298495
             }
         },
    
  3. I think this is because I setInserted an array to another array (since I appended the strings to an empty array in map at <PLACEHOLDER_1>). Therefore:

         r.db('DB').table('TB').group('message').map(function(doc){
           return {
             count: 1,
             from: doc('timestamp'),
             to: doc('timestamp'),
             tag: doc('tag')
           }
         }).reduce(function(left, right){
           return {
             count: left('count').add(right('count')),
             earliest: r.branch(left('from').lt(right('from')), left('from'), right('from')),
             latest: r.branch(left('to').gt(right('to')), left('to'), right('to')),
             tag: r.branch(left('tag').typeOf().eq('STRING'), r([]).append(left('tag')).setInsert(right('tag')), left('tag').setInsert(right('tag')))
           }
         })
    

    But still:

         {
             "message": "ReferenceError: Can't find variable: $",
             "reduction": {
                 "tags": [
                     "bingbot",
                     "IE",
                     "Mobile Safari",
                     [
                         "bingbot",
                         "Mobile Safari"
                     ],
                     [
                         "bingbot",
                         "Firefox"
                     ],
                     [
                         "bingbot",
                         "Mobile Safari",
                         "IE",
                         "Firefox"
                     ],
                     [
                         "bingbot",
                         "Mobile Safari",
                         "IE"
                     ],
                     [
                         "bingbot",
                         "IE",
                         "Mobile Safari"
                     ]
                 ],
                 "count": 3129,
                 "earliest": 1431167614434,
                 "latest": 1431592586637
             }
         },
    

I think the problem is due to the “common mistake” mentioned in the documentation that the reduce does not execute from left to right.

A common mistaken when using the reduce command is to suppose that the reduction is executed from left to right. - RethinkDB Documentation

Solution:

I solved this with:

    r.db('DB').table('TB').group('message').map(function(doc){
      return {
        count: 1,
        from: doc('timestamp'),
        to: doc('timestamp'),
        tags: r([]).append(doc('tag'))
      }
    }).reduce(function(left, right){
      return {
        count: left('count').add(right('count')),
        from: r.branch(left('from').lt(right('from')), left('from'), right('from')),
        to: r.branch(left('to').gt(right('to')), left('to'), right('to')),
        tags: left('tags').setUnion(right('tags'))
      }
    })

Just change setInsert to setUnion

Disclaimer: This is a personal weblog. The opinions expressed here represent my own and not those of any entity with which I have been, am now, or will be affiliated.

Comments