I have a PySpark DataFrame with a column containing XML strings, and I'm using XPath queries with absolute paths to extract data from these XML strings. However, I've noticed that the XPath queries return lists that omit values if they are not present, rather than including None in their place. I would like to keep the length of the lists consistent, filling in None where data is missing.
Here is the sample data and code I'm working with:
data = [
(1, """
Lion
Apple
Banana
Tiger
Cranberry
"""),
(2, """
Lion
Apple
Tiger
Banana
Zebra
""")
df = spark.createDataFrame(data, ["id", "xml_string"])
What the XPath queries return:
For data column:
(1, ["Apple","Banana","Cranberry"], ["Lion","Tiger"])
(2, ["Apple","Banana"], ["Lion","Tiger","Zebra"])
What I want:
For data column:
(1, ["Apple","Banana","Cranberry"], ["Lion", None, "Tiger"])
(2, ["Apple","Banana", None], ["Lion","Tiger","Zebra"])
How can I adjust my XPath queries?
root/level1/level2/level3/level4/data
root/level1/level2/level3/data2
No comments:
Post a Comment
Thanks