Pages

20 September, 2024

pyspark XPath Query Returns Lists Omitting Missing Values Instead of Including None

I have a PySpark DataFrame with a column containing XML strings, and I'm using XPath queries with absolute paths to extract data from these XML strings. However, I've noticed that the XPath queries return lists that omit values if they are not present, rather than including None in their place. I would like to keep the length of the lists consistent, filling in None where data is missing.


Here is the sample data and code I'm working with:
data = [
(1, """



Lion

Apple






Banana





Tiger

Cranberry




"""),
(2, """



Lion

Apple





Tiger

Banana





Zebra




""")

df = spark.createDataFrame(data, ["id", "xml_string"])



What the XPath queries return:


For data column:
(1, ["Apple","Banana","Cranberry"], ["Lion","Tiger"])
(2, ["Apple","Banana"], ["Lion","Tiger","Zebra"])
What I want:


For data column:
(1, ["Apple","Banana","Cranberry"], ["Lion", None, "Tiger"])
(2, ["Apple","Banana", None], ["Lion","Tiger","Zebra"])


How can I adjust my XPath queries?
root/level1/level2/level3/level4/data

root/level1/level2/level3/data2

No comments:

Post a Comment

Thanks