Create and append key value pairs to a file of JSON objects with a regex of each JSON object value

Question

I have a big file that contains json objects each object in a new line.

File example

{"Name" :"%Hana-29-Mrs-Smith","job":"engineer"}
{"Name" :"%Mike-31-Mr-Larry","job":"marketing"}
{"Name" :"%Jhon-40-Mr-Doe","job":"engineer"}

Desired output:

{"Name" :"%Hana-29-Mr-Smith", "f_nams":"Hana", "age":29, "title":"Mrs", "l_name":"Smith","job":"engineer"}
{"Name" :"%Mike-29-Mr-Larry", "f_nams":"Mike", "age":31, "title":"Mr", "l_name":"Larry","job":"marketing"}
{"Name" :"%Jhon-29-Mr-Smith", "f_nams":"Jhon", "age":40, "title":"Mr", "l_name":"Doe","job":"engineer"}

It is nice to see a well-worded and clear title of the question describing the exact issue and is a useful question, because json files can be not just one json object. +1 for usefulness and clarity — Sergiy Kolodyazhnyy, Jul 25 '20 at 00:32
Coming from you, means a lot.Thank you @SergiyKolodyazhnyy ! — mongotop, Jul 25 '20 at 20:05
you brought a great subject @SergiyKolodyazhnyy! when getting small files it might be useful to append those small file together until it reaches certain size "depend on your hardware" and then run one shell-cmd on it, might see some improvements. depend on the scenario in hand! — mongotop, Jul 25 '20 at 20:08

score 4 · Accepted Answer · answered Jul 25 '20 at 00:31

For non-nested objects such as this, you could consider using Miller

$ mlr --json put -S '
    @x = splitnv(substr($Name,1,-1),"-"); $f_nams = @x[1]; $age = @x[2]; $title = @x[3]; $l_name = @x[4]
  ' then reorder -e -f job file.json
{ "Name": "%Hana-29-Mrs-Smith", "f_nams": "Hana", "age": 29, "title": "Mrs", "l_name": "Smith", "job": "engineer" }
{ "Name": "%Mike-31-Mr-Larry", "f_nams": "Mike", "age": 31, "title": "Mr", "l_name": "Larry", "job": "marketing" }
{ "Name": "%Jhon-40-Mr-Doe", "f_nams": "Jhon", "age": 40, "title": "Mr", "l_name": "Doe", "job": "engineer" }

What a tool! the speed and efficiency! Thank you! The only drawback I found was the regex capabilities, a regex statement works fine in `regex101` but not in regex argument for `mlr` — mongotop, Jul 26 '20 at 03:04

Sergiy Kolodyazhnyy · Answer 2 · 2020-07-25T01:04:17.730

One of the possible ways that is expressive, procedural and clear (although the script itself can seem a bit lengthy), is to use Python3 with json module.

#!/usr/bin/env python3
import json
import sys

with open(sys.argv[1]) as json_file:
    for line in json_file:
        json_obj = dict(json.loads(line))
        tokens = json_obj["Name"].split('-')
        extra_data = { 
            "f_nams": tokens[0].replace('%','') ,
            "age"   : tokens[1],
            "title" : tokens[2],
            "l_name": tokens[3]
        }
        joined_data = {**json_obj, **extra_data}
        print(json.dumps(joined_data))

The way it works is that we use a context manager open() to open the file and to be closed automatically upon completion. From the sample data in the question we may assume that each json object is on separate lines (NOTE: if the actual data you use has multi-line json objects, you may have to adapt the script to use try-except block to read file until full json data is read into a variable).

From there it's just text manipulations and Python magic: split value of key "Name" into tokens on - character into a list, put list of tokens into new dictionary and join the two dictionaries with Python 3.5 ** operator, which I believe is called "keyword unpacking" ( if you use other version of Python, check the link for alternatives ). All that is converted back into json object and printed on standard output. If you do need to save it to new file, use shell redirection as in ./parse_data.py ./data.json > ./new_data.json or if you want to see it simultaneously on screen ./parse_data.py ./data.json | tee ./new_data.json

How it works in action:

$ ./parse_data.py ./data.json 
{"Name": "%Hana-29-Mrs-Smith", "job": "engineer", "f_nams": "Hana", "age": "29", "title": "Mrs", "l_name": "Smith"}
{"Name": "%Mike-31-Mr-Larry", "job": "marketing", "f_nams": "Mike", "age": "31", "title": "Mr", "l_name": "Larry"}
{"Name": "%Jhon-40-Mr-Doe", "job": "engineer", "f_nams": "Jhon", "age": "40", "title": "Mr", "l_name": "Doe"}

$ cat ./data.json 
{"Name" :"%Hana-29-Mrs-Smith","job":"engineer"}
{"Name" :"%Mike-31-Mr-Larry","job":"marketing"}
{"Name" :"%Jhon-40-Mr-Doe","job":"engineer"}

Thank you for the valuable example!; I did it using python. I was thinking about using shell cmd because it's faster. Talking about streaming log file with 100k+ rows. For that I was thinking about shell would be faster than python — mongotop, Jul 25 '20 at 00:56
@mongotop Yes, if the speed is a concern then using command-line utilities written in C will be faster, of which `Miller` presented in steeldriver's answer should be appropriate. Of course there are ways to speed up python or even use Cython, but for the best performance C-specific utilities are the best — Sergiy Kolodyazhnyy, Jul 25 '20 at 00:58

Create and append key value pairs to a file of JSON objects with a regex of each JSON object value

2 Answers2