Learning to be Giant.

Use RethinkDB to extract distinct tags with Map-reduce

|

This is originally a question I asked on StackOverflow, but I came up with the answer by myself.

I have a database whose documents are of the format:

    {
        'timestamp': 1431307846643,
        'tag': 'tag1',
        'message': 'message content'
    }

What I want to do is to:

  1. group all the documents by message
  2. find the earliest and latest timestamp from each group
  3. find all the distinct tags in a group

I would like the result to be something like:

    {
        'message': 'message content shared by the group',
        'earliest': 1431307840000,
        'latest': 1431307849999,
        'tags': ['tag1', 'tag2', 'tag3']
    }

I tried the following code in data explorer:

    r.db('DB').table('TB').group('message').map(function(doc){
      return {
        count: 1,
        from: doc('timestamp'),
        to: doc('timestamp'),
        tags: <PLACEHOLDER_1>
      }
    }).reduce(function(left, right){
      return {
        count: left('count').add(right('count')),
        from: r.branch(left('from').lt(right('from')), left('from'), right('from')),
        to: r.branch(left('to').gt(right('to')), left('to'), right('to')),
        tags: <PLACEHOLDER_2>
      }
    })

Problems are:

  1. If I fill <PLACEHOLDER_1> with doc('tag') and <PLACEHOLDER_2> with left('tags').setInsert(right('tags')), I get RqlRuntimeError: Expected type ARRAY but found STRING in: ....
  2. So I changed <PLACEHOLDER_1> to r([]).append(doc('tag')). Because I do not know how to access an array element in ReQL, I left <PLACEHOLDER_2> unchanged. This results in something like: js { "group": "Message Content", "reduction": { "tags": [ "Mobile Safari", [ "Mobile Safari" ], [ "Mobile Safari", [ "Mobile Safari" ] ] ], "count": 21, "earliest": 1431178639364, "latest": 1431584298495 } },

  3. I think this is because I setInserted an array to another array (since I appended the strings to an empty array in map at <PLACEHOLDER_1>). Therefore:

    js r.db('DB').table('TB').group('message').map(function(doc){ return { count: 1, from: doc('timestamp'), to: doc('timestamp'), tag: doc('tag') } }).reduce(function(left, right){ return { count: left('count').add(right('count')), earliest: r.branch(left('from').lt(right('from')), left('from'), right('from')), latest: r.branch(left('to').gt(right('to')), left('to'), right('to')), tag: r.branch(left('tag').typeOf().eq('STRING'), r([]).append(left('tag')).setInsert(right('tag')), left('tag').setInsert(right('tag'))) } })

    But still:

    js { "message": "ReferenceError: Can't find variable: $", "reduction": { "tags": [ "bingbot", "IE", "Mobile Safari", [ "bingbot", "Mobile Safari" ], [ "bingbot", "Firefox" ], [ "bingbot", "Mobile Safari", "IE", "Firefox" ], [ "bingbot", "Mobile Safari", "IE" ], [ "bingbot", "IE", "Mobile Safari" ] ], "count": 3129, "earliest": 1431167614434, "latest": 1431592586637 } },

I think the problem is due to the “common mistake” mentioned in the documentation that the reduce does not execute from left to right.

A common mistaken when using the reduce command is to suppose that the reduction is executed from left to right. - RethinkDB Documentation

Solution:

I solved this with:

    r.db('DB').table('TB').group('message').map(function(doc){
      return {
        count: 1,
        from: doc('timestamp'),
        to: doc('timestamp'),
        tags: r([]).append(doc('tag'))
      }
    }).reduce(function(left, right){
      return {
        count: left('count').add(right('count')),
        from: r.branch(left('from').lt(right('from')), left('from'), right('from')),
        to: r.branch(left('to').gt(right('to')), left('to'), right('to')),
        tags: left('tags').setUnion(right('tags'))
      }
    })

Just change setInsert to setUnion

Comments