1

I've inherited a legacy DB in which excel data is stored in a text column of a table in a postgres DB. A value from that column might look like:

<Sheets> <Sheet1> <Addresses E54="3" G23="1.1" N87="0"/> </Sheet1> <Sheet2> <Addresses W32="thing"/> </Sheet2> </Sheets> 

I know I can pick out values of specific adresses with

select xpath( '//Addresses/@E54', cast(ssd.data as xml)) from spreadsheetdata ssd 

but I have no idea how many distinct addresses exist.

What I'm hoping to do is produce a table looking like:

sheet address value Sheet1 E54 "3" Sheet1 G23 "1.1" Sheet1 N87 "0" Sheet2 W32 "thing" ... 

How do I do that?

1
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer.
    – CommunityBot
    CommentedAug 6, 2024 at 8:14

1 Answer 1

2

My solution was to first convert the xml to json; xml_to_json

Then I defined:

-- converts xml from data in spreadsheet templates table into a table via json CREATE OR REPLACE FUNCTION public.xml_to_table(xml) RETURNS TABLE(sheetname text, attributename text, attributevalue text) LANGUAGE 'sql' COST 100 VOLATILE PARALLEL UNSAFE ROWS 1000 AS $BODY$ -- from the records returned by the subquery below this returns records with columns SheetName, -- attribute name (address) and attribute value e.g.: -- Sheet1 E54 3 -- Sheet1 G23 1.1 -- Sheet1 N87 0 -- Sheet2 W32 thing -- ... select e.sheetname, jsonb_object_keys(e.attr) as attributename, e.attr ->> jsonb_object_keys(e.attr) as attributevalue from ( -- removes the rows with null for the list of attributes from the results from the subquery under this, and -- separates each attribute to its own row e.g.: -- Sheet1 {"E54": "3"} -- Sheet1 {"G23": "1.1"} -- Sheet1 {"N87": "0"} -- Sheet2 {"W32": "thing"} -- ... select d.sheetname, jsonb_array_elements(d.exceldata) as attr from ( -- separates each line from the subquery under this into records containging columns for sheetname, and -- this can handle xml having more than one element at the addresses level -- (e.g. it can handle a NamedeCells element alongside Addresses ) -- Sheet1 [{"E54": "3"}, ... -- Sheet2 [{"W32": "thing"}] -- ... select b.sheetname, b.records -> jsonb_object_keys(b.records) -> 'attr' as exceldata from ( -- separates each line from the subquery under this into records with columns for sheetname and -- a row for the JSON for each of Addresses e.g.: -- Sheet1 {Addresses: {attr: ... -- Sheet2 {Addresses: {attr: ... -- ... select jsonb_object_keys(a.sheetjson) as sheetname, jsonb_array_elements((a.sheetjson->jsonb_object_keys(a.sheetjson) -> 'childs')) as records from ( -- separates the supplied xml into json records for each sheet e.g.: -- {Sheet1: {attr: ... -- {Sheet2: {attr: ... -- ... select jsonb_array_elements(xml_to_json($1)->'Sheets'->'childs') as sheetjson ) as a ) as b ) as d ) as e; $BODY$; 

this can be called with

select * from xml_to_table('<Sheets> <Sheet1> <Addresses E54="3" G23="1.1" N87="0"/> </Sheet1> <Sheet2> <Addresses W32="thing"/> </Sheet2> </Sheets>') 

to produce

sheetnameattributenameattributevalue
Sheet1E543
Sheet1G231.1
Sheet1N870
Sheet2W32thing

This function is not a general as I'd like but it will suffice for my data clean up needs

Any comments welcome.

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.