XML parsing into plain Map in Golang

от автора

While in 2024 using XML looks a bit outdated, it still happens. And sometimes it happens we are to deal with XML having «free-structure», i.e. it couldn’t be parsed into tree of user-defined structs. For JSON there still is a way to parse it with a single call into map[string]any and work with it using careful type assertions. Regretfully, there is no similar feature for XML (in Golang). Here I’ll draft suitable function and demonstrate it — both for others and for myself if I ever need this again (recreating it from scratch may be somewhat painful).

Before continuing, let’s note that other languages often provide easy approaches to free-form JSON and XML — you will find something in Java for example — but scripting languages work best in this case (e.g. PHP with its json_decode and xml_parse) — so you may consider picking different tool if you are not already in the depths of some large Go project.

Now to the matter. There is encoding/xml package but regretfully its documentation is pretty laconic and you may have difficulty making sense out of it.

The overview of the idea is like this:

  • define some top-level struct to parse your xml into — and add UnmarshalXML method to it — this will cause unmarshaller to use this custom method for parsing our struct

  • inside the method we iterate over the plain sequence of tags (opening and closing) and data between them using decoder.Token() to get every next element

  • we create the top-level map[string]any to put result in, but we also need to maintain a sequence of nested maps as a «path» representing the current tag (a stack actually)

  • thus whenever we find new opening tag, we add a new map to the current chain, when we find closing tag, we pop from it

Let’s have a look at the code and discuss it further.

type rootElem struct {     root map[string]any }  func (x *rootElem) UnmarshalXML(decoder *xml.Decoder, start xml.StartElement) error {     x.root = map[string]any{"_": start.Name.Local}     path := []map[string]any{x.root}     for {         token, err := decoder.Token()         if err != nil {             if err == io.EOF {                 return nil             }             return err         }         switch elem := token.(type) {         case xml.StartElement:             newMap := map[string]any{"_": elem.Name.Local}             path[len(path)-1][elem.Name.Local] = newMap             path = append(path, newMap)         case xml.EndElement:             path = path[:len(path)-1]         case xml.CharData:             val := strings.TrimSpace(string(elem))             if val == "" {                 break             }             curName := path[len(path)-1]["_"].(string)             path[len(path)-2][curName] = typeConvert(val)         }     } }  func typeConvert(s string) any {     f, err := strconv.ParseFloat(s, 64)     if err == nil {         return f     }     return s }

What happens here? As we said, we need a (throw-away) struct for unmarshaller to work with — it is rootElem and it has a method to be called by unmarshaller. Inside we initialize the root map and stack of maps leading to the current element. Then there is a loop of parsing XML token by token.

When opening tag is encountered (StartElement), we create a new map and push it onto the stack, also attaching it under corresponding name as a field to the map one level up.

When closing tag is encountered (EndElement) we simply pop the stack. Self-closed tags (those without content, e.g. <someFlag/> are properly processed as a pair of start and end tag by the decoder.

In the CharData we capture text content of the tag. There we skip sequences of spaces (which may occure, as newlines and indents, between two opening tags for example). If the content is not empty, we replace it in the map one level below the top in the stack, considering «current tag» to be «leaf» with simple value rather than single-element map.

The typeConvert function is something which you may customize — as an example it tries to return numbers not as strings but as float64 (exactly as JSON unmarshaller in Go does).

For convenience every map gets extra field denoted with underscore — it holds the name of that map in the parent’s struct. So the root element’s name is also preserved in the top-level map, e.g. root["_"] provides the root element tag.

To call this conversion you need a few lines more:

x := rootElem{} if err := xml.Unmarshal(data, &x); err != nil {     panic(err) } fmt.Printf("%v\n", x.root)

Here data are byte array with your XML content. See example below.

Limitations

The code above doesn’t specifically deal with same-name tags under the same parent. You may want to put them into an array or to signalize error — make corresponding changes at line 19. Regretfully there is no non-contradictory way to deal with such cases as until we met second element of the same name we don’t know whether we should create array or no. Also of course there is no simple way to distinguish types in some situations.

Example

Let us try converting the following XML:

<main>   <person>     <age>23</age>     <name>Zlodeec</name>   </person>   <id>AAA-013-B14560</id> </main>

result (you may try this sandbox code: https://go.dev/play/p/Oeby-4_ymJM) is as following:

map[_:main id:AAA-013-B14560 person:map[_:person age:23 name:Zlodeec]]

As this is a plain map quite compatible with Go’s default JSON marshalling, we can at once convert it to JSON also:

{   "_": "main",   "id": "AAA-013-B14560",   "person": {     "_": "person",     "age": 23,     "name": "Zlodeec"   } } 

That’s it. The only thing to add is that the source idea of processing XML as a sequence of tokens was found — so I enriched it with map-building logic etc. It’ll be cool if you find this useful.


ссылка на оригинал статьи https://habr.com/ru/articles/847854/


Комментарии

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *