Optimize and accelerate bash script to transform XML to SQLITE

Question

I have this script in which I read an XML and I pass it to CSV and at the end of the script I transform it into SQLITE

#!/bin/bash rm -f -r rshost rscname rsctype ttstamp tservice tformat trdata trdata2 cat $1 | grep Telegram | sed -e 's/"/ /g' | awk '{ print $9 }' | cut -c27-30 > trdata cat $1 | grep Telegram | sed -e 's/"/ /g' | awk '{ print $3 }'| cut -c1-19 > ttstamp a=`cat $1 | grep RecordStart | head -1 | sed -e 's/"/ /g'| awk '{ print $15 }'` b=`cat $1 | grep RecordStart | head -1 | sed -e 's/"/ /g' | awk '{print $12 }' | sed -e 's/;/ /g' | sed -e 's/=/ /g' | awk '{print $4 }'` c=`cat $1 | grep RecordStart | head -1 | sed -e 's/"/ /g' | awk '{print $9,$10 }'` touch rsctype rshost rscname kk=`wc -l trdata | awk '{ print $1 }'` for i in `seq 1 $kk` do echo $a >> rsctype echo $b >> rshost echo $c >> rscname done cat $1 | grep Telegram | sed -e 's/"/ /g' | awk '{ print $5 }' > tservice cat $1 | grep Telegram | sed -e 's/"/ /g' | awk '{ print $7 }' > tformat cat $1 | grep Telegram | sed -e 's/"/ /g' | awk '{ print $9 }' | cut -c41-44 > trdata2 cat $1 | grep Telegram | sed -e 's/"/ /g' | awk '{ print $9 }' | cut -c31-34 > grhost awk -Wposix '{printf("%d\n","0x" $1)}' trdata > trdata3 awk -Wposix '{printf("%d\n","0x" $1)}' trdata2 > trdata4 sed -i "s%^0%0/%g" grhost cat grhost | cut -c1-3 > grhost2 sed -i "s%.\{4\}%/%g" grhost pr -mts, grhost2 grhost > grhostfinal sed -i "s/,//g" grhostfinal cat grhostfinal | cut -c1-4 > grhostfinal1 cat grhostfinal | cut -c5 > grhostfinal2 awk -Wposix '{printf("%d\n","0x" $1)}' grhostfinal2 > grhostfinal3 pr -mts, grhostfinal1 grhostfinal3 > grhostfinal4 sed -i "s/,//g" grhostfinal4 pr -mts, rshost ttstamp rsctype tservice tformat trdata4 trdata3 rscname grhostfinal4 > conjunto.csv sed -i "s|^|,|g" conjunto.csv sqlite3 test2.sqlite "select fecha from testxml4;" > data.csv cat data.csv | sort | uniq > data2.csv for k in `cat data2.csv` do grep "$k" conjunto.csv >> quitar done diff quitar conjunto.csv | grep ">" | sed 's/^> //g' > diferencia.csv echo `sqlite3 test2.sqlite < testxml` python csv2sqlite.py diferencia.csv test2.sqlite testxml4 rm -f -r rshost rscname rsctype ttstamp tservice tformat trdata trdata2 trdata3 trdata4 grhost2 grhost grhostfinal3 grhostfinal1 grhostfinal2 grhostfinal grhostfinal4 a b c data.csv conjunto.csv data2.csv quitar

I have this XML (The data is a private)

 <CommunicationLog xmlns="http://knx.org/xml/telegrams/01"> <RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" /> <Telegram Timestamp="" Service="" FrameFormat="" RawData="" /> <Telegram Timestamp="" Service="" FrameFormat="" RawData="" /> <RecordStop Timestamp="" /> <RecordStart Timestamp="" Mode="" Host="" ConnectionName="" ConnectionOptions="" ConnectorType="" MediumType="" /> <Telegram Timestamp="" Service="" FrameFormat="" RawData="" /> <Telegram Timestamp="" Service="" FrameFormat="" RawData="" /> <RecordStop Timestamp="" /> </CommunicationLog>

Once analyzed the data, I take them to a CSV and with the Python program csv2sqlite.py

python csv2sqlite.py CSVFILE.csv DB.sqlite TABLESQLITE

My question is how can I make this script faster and more efficient, since it takes a long time to analyze all the data.

Can you provide the database schema and some dummy data for the XML? Like the CREATE TABLE statement you used and the format of the attributes. An explanation of what each file should contain would also be nice, as currently I'm getting RawData= for tservice and Service= for ttstamp, which is super confusing. — Gao, CommentedNov 21, 2017 at 19:15

Gao · Accepted Answer · 2017-11-21 19:23:50Z

Take-home message: pipelines in Bash are slow (compared to process substitution); loops in Bash are slow; Bash are slow.

Without a setup script and dummy data to test run it, I don't really understand exactly what your script is trying to achieve at each step, so I can only suggest the following improvements:

Avoid unnecessary pipelines

A lot of your code does cat $1 | grep Telegram | sed -e 's/"/ /g', which can be simplified to sed '/Telegram/!d; s/"/ /g' "$1", so you may want to save the result somewhere and extract from it when needed.
awk '{ print $9 }' | cut -c27-30 can be combined into awk '{print substr($9, 27, 4)}'.
In the command substitution that gets assigned to b, you have sed -e 's/;/ /g' | sed -e 's/=/ /g' which should really just be sed -e 's/;/ /g' -e 's/=/ /g' or even better just sed 's/[;=]/ /g'. You don't need the -e option if you're not combining expressions.
kk=`wc -l trdata | awk '{ print $1 }'`: kk=`wc -l < trdata` can do the job just fine.
cat grhostfinal | cut -c1-4 > grhostfinal1 is not as efficient as cut -c1-4 < grhostfinal > grhostfinal1

sqlite3 test2.sqlite "select fecha from testxml4;" > data.csv cat data.csv | sort | uniq > data2.csv

is probably best done entirely in SQL:

sqlite3 test2.sqlite "SELECT DISTINCT fecha FROM testxml4 ORDER BY fecha;" > data.csv

Avoid loops if possible, or optimize for each iteration of the loop

```
for i in `seq 1 $kk` do echo $a >> rsctype done 
```
is much slower than
```
printf "$a\n%.0s" `seq 1 $kk` >> rsctype 
```
for small $kk, and much slower than
```
yes "$a" | head -n "$kk" >> rsctype 
```
for large $kk. See https://superuser.com/questions/86340/linux-command-to-repeat-a-string-n-times. (Not sure if you actually want to repeat the same strings, but that's what your code does.)

for k in `cat data2.csv` do grep "$k" conjunto.csv >> quitar done diff quitar conjunto.csv | grep ">" | sed 's/^> //g' > diferencia.csv

looks like it could be done with just one diff? Maybe (haven't tested it):

diff --changed-group-format='%>' --unchanged-group-format='' data2.csv conjunto.csv\ > diferencia.csv

Non-performance-related notes

You should wrap your cleanup command in a trap and put it at the beginning of the script so that it will always be executed unless the program is terminated by a SIGKILL:
```
trap 'rm -f \ rs{host,cname,ctype} \ ttstamp tservice tformat trdata{,2,3,4} \ grhost{,2,final{,{1..4}}} \ data{,2}.csv conjunto.csv \ {a..c} quitar' \ 'EXIT' 
```
Please don't rm -r if you're not deleting directories. This is a dangerous command if you're not careful. I used brace expansion to shorten the list of input files I need to type, but I'm not sure if you really need to create that many temporary files.
You don't need to touch the files. Redirection will create the files if they don't exist.

I can probably offer more suggestions if I could try out the script. I'll add to this answer if I think of anything, but this should be enough for now.

Stack Exchange Network

Optimize and accelerate bash script to transform XML to SQLITE

1 Answer 1

Avoid unnecessary pipelines

Avoid loops if possible, or optimize for each iteration of the loop

Non-performance-related notes

Linked

Hot Network Questions

Optimize and accelerate bash script to transform XML to SQLITE

1 Answer 1

Avoid unnecessary pipelines

Avoid loops if possible, or optimize for each iteration of the loop

Non-performance-related notes

Linked

Related

Hot Network Questions