TL;DR
Within the humanitarian response world there will be tens of 1000’s of tabular (CSV and Excel) datasets, lots of which include essential info for serving to save lives. Knowledge will be offered by tons of of various organizations with completely different naming conventions, languages and knowledge requirements, so having info (metadata) about what every column represents in tables is essential for locating the correct knowledge and understanding the way it suits collectively. A lot of this metadata is ready manually, which is time-consuming and error inclined, so any automated methodology can have an actual impact in the direction of serving to individuals. On this article we revisit a earlier evaluation “Predicting Metadata of Humanitarian Datasets with GPT 3” to see how advances within the final 18 months open the way in which for extra environment friendly and fewer time-consuming strategies for setting metadata on tabular knowledge.
Utilizing metadata-tagged CSV and Excel datasets from the Humanitarian Knowledge Trade (HDX) we present that fine-tuning GPT-4o-mini works nicely for predicting Humanitarian Trade Language (HXL) tags and attributes for the most typical tags associated to location and dates. Nevertheless, for much less well-represented tags and attributes the approach could be a bit restricted as a consequence of poor high quality coaching knowledge the place people have made errors in manually labelling knowledge or just aren’t utilizing all potential HXL metadata mixtures. It additionally has the limitation of not having the ability to regulate when the metadata normal modifications, for the reason that coaching knowledge wouldn’t replicate these modifications.
Given extra highly effective LLMs are actually accessible, we examined a way to instantly immediate GPT-4o or GPT-4o-mini quite than fine-tuning, offering the total HXL core schema definition within the system immediate now that bigger context home windows can be found. This method was proven to be extra correct than fine-tuning when utilizing GPT-4o, capable of help rarer HXL tags and attributes and requiring no customized coaching knowledge, making it simpler to handle and deploy. It’s nevertheless dearer, however not if utilizing GPT-4o-mini, albeit with a slight lower in efficiency. Utilizing this method we offer a easy Python class in a GitHub Gist that can be utilized in knowledge processing pipelines to routinely add HXL metadata tags and attributes to tabular datasets.
About 18 months in the past I wrote a weblog submit Predicting Metadata of Humanitarian Datasets with GPT 3.
That’s proper, with GPT 3, not even 3.5! 🙂
Even so, again then Massive Language Mannequin (LLM) fine-tuning produced nice efficiency for predicting Humanitarian Trade Language (HXL) metadata fields for tabular datasets on the superb Humanitarian Knowledge Trade (HDX). In that examine, the coaching knowledge represented the distribution of HXL knowledge on HDX and so was comprised of the most typical tags referring to location and dates. These are essential for linking completely different datasets collectively in location and time, an important consider utilizing knowledge to optimize humanitarian response.
The LLM area has since superior … a LOT.
So on this article, we’ll revisit the approach, increase it to cowl much less frequent HXL tags and attributes and discover different choices now accessible to us for conditions the place a posh, high-cardinality taxonomy must be utilized to knowledge. We can even discover the flexibility to foretell much less frequent HXL normal tags and attributes not at the moment represented within the human-labeled coaching knowledge.
You may observe together with this evaluation by opening these notebooks in Google Colab or operating them domestically:
Please discuss with the README within the repo for set up directions.
For this examine, and with assist from the HDX group, we’ll use knowledge extracted from the HDX platform utilizing a crawler course of they run to trace using HXL metadata tags and attributes on the platform. You will discover nice HXL assets on GitHub, however if you wish to observe together with this evaluation I’ve additionally saved the supply knowledge on Google Drive because the crawler will take days to course of the tons of of 1000’s of tabular datasets on HDX.
The information seems to be like this, with one row per HXL-tagged desk column …
The HXL postcard is a extremely nice overview of the most typical HXL tags and attributes within the core schema. For our evaluation, we’ll apply the total normal as discovered on HDX which offers a spreadsheet of supported tags and attributes …
The generate-test-train-data.ipynb pocket book offers all of the steps taken to create check and coaching datasets, however listed below are some key factors to notice:
1. Removing of automated pipeline repeat HXL knowledge
On this examine, I eliminated duplicate knowledge created by automated pipelines that add knowledge to HDX, by utilizing an MDF hash of column names in every tabular dataset (CSV and Excel recordsdata). For instance, a CSV file of inhabitants statistics created by a corporation is usually very comparable for every country-specific CSV or Excel file, so we solely take one instance. This has a balancing impact on the info, offering extra variation of HXL tags and attributes by eradicating very comparable repeat knowledge.
2. Constraining knowledge to legitimate HXL
About 50% of the HDX knowledge with HXL tags makes use of a tag or attribute which aren’t specified within the HXL Core Schema, so this knowledge is faraway from coaching and check units.
3. Knowledge enrichment
As a (principally!) human being, when deciding what HXL tags and attributes to make use of on a column, I take a peek on the knowledge for that column and likewise the info as a complete within the desk. For this evaluation we do the identical for the LLM fine-tuning and immediate knowledge, including in knowledge excerpts for every column. A desk description can be added utilizing an LLM (GPT-3.5-Turbo) abstract of the info to make them constant, as summaries on HDX can range in kind, starting from pages to some phrases.
4. Fastidiously splitting knowledge to create practice/check units
Many machine studying pipelines break up knowledge randomly to create coaching and check units. Nevertheless, for HDX knowledge this might lead to columns and recordsdata from the identical group being in practice and check. I felt this was a bit too simple for testing predictions and so as an alternative break up the info by organizations to make sure organizations within the check set weren’t within the coaching knowledge. Moreover, subsidiaries of the identical mother or father group — eg “ocha-iraq” and “ocha-libya” — weren’t allowed to be in each the coaching and check units, once more to make the predictions extra real looking. My goal was to check prediction with organizations as if their knowledge had by no means been seen earlier than.
After all the above and down-sampling to avoid wasting prices, we’re left with 2,883 rows within the coaching set and 485 rows within the check set.
In my authentic article I opted for utilizing a completion mannequin, however with the discharge of GPT-4o-mini I as an alternative generated prompts applicable for fine-tuning a chat mannequin (see right here for extra details about the accessible fashions).
Every immediate has the shape …
{
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": ""
},
{
"role": "assistant",
"content": ""
}
]
}
Word: The above has been formatted for readability, however JSONL could have every thing in a single line per report.
Utilizing the info excerpts, LLM_generated desk description, column title we collated, we are able to now generate prompts which seem like this …
{
"messages": [
{
"role": "system",
"content": "You are an assistant that replies with HXL tags and attributes"
},
{
"role": "user",
"content": "What are the HXL tags and attributes for a column with these details?
resource_name='admin1-summaries-earthquake.csv';
dataset_description='The dataset contains earthquake data for various
administrative regions in Afghanistan,
including country name, admin1 name, latitude,
longitude, aggregation type, indicator name,
and indicator value. The data includes maximum
earthquake values recorded in different regions,
with corresponding latitude and longitude coordinates.
The dataset provides insights into the seismic
activity in different administrative areas of
Afghanistan.';
column_name:'indicator';
examples: ['earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake']"
},
{
"position": "assistant",
"content material": "#indicator+title"
}
]
}
We now have check and coaching recordsdata in the correct format for fine-tuning an OpenAI chat mannequin, so let’s tune our mannequin …
def fine_tune_model(train_file, model_name="gpt-4o-mini"):
"""
Positive-tune an OpenAI mannequin utilizing coaching knowledge.Args:
prompt_file (str): The file containing the prompts to make use of for fine-tuning.
model_name (str): The title of the mannequin to fine-tune. Default is "davinci-002".
Returns:
str: The ID of the fine-tuned mannequin.
"""
# Add file to OpenAI for fine-tuning
file = shopper.recordsdata.create(
file=open(train_file, "rb"),
goal="fine-tune"
)
file_id = file.id
print(f"Uploaded coaching file with ID: {file_id}")
# Begin the fine-tuning job
ft = shopper.fine_tuning.jobs.create(
training_file=file_id,
mannequin=model_name
)
ft_id = ft.id
print(f"Positive-tuning job began with ID: {ft_id}")
# Monitor the standing of the fine-tuning job
ft_result = shopper.fine_tuning.jobs.retrieve(ft_id)
whereas ft_result.standing != 'succeeded':
print(f"Present standing: {ft_result.standing}")
time.sleep(120) # Anticipate 60 seconds earlier than checking once more
ft_result = shopper.fine_tuning.jobs.retrieve(ft_id)
if 'failed' in ft_result.standing.decrease():
sys.exit()
print(f"Positive-tuning job {ft_id} succeeded!")
# Retrieve the fine-tuned mannequin
fine_tuned_model = ft_result.fine_tuned_model
print(f"Positive-tuned mannequin: {fine_tuned_model}")
return fine_tuned_model
mannequin = fine_tune_model("hxl_chat_prompts_train.jsonl", model_name="gpt-4o-mini-2024-07-18")
Within the above we’re utilizing the brand new GPT-4-mini mannequin, which from OpenAI is at the moment free to fine-tune …
“Now by means of September 23, GPT-4o mini is free to fine-tune as much as a day by day restrict of 2M coaching tokens. Overages over 2M coaching tokens can be charged at $3.00/1M tokens. Beginning September 24, fine-tuning coaching will value $3.00/1M tokens. Take a look at the fine-tuning docs for extra particulars on free entry.”
Even at $3.00/1 Million tokens, the prices are fairly low for this activity, popping out at about $7 a fine-tuning run for simply over 2 million tokens within the check file. Allowing for, fine-tuning needs to be a uncommon occasion for this specific activity, as soon as now we have such a mannequin it may be reused.
The fine-tuning produces the next output …
Uploaded coaching file with ID: file-XXXXXXXXXXXXXXX
Positive-tuning job began with ID: ftjob-XXXXXXXXXXXXXXX
Present standing: validating_files
Present standing: validating_files
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Positive-tuning job ftjob-XXXXXXXXXXXXXXX succeeded!
Positive-tuned mannequin: ft:gpt-4o-mini-2024-07-18::XXXXXXX
It took about 45 minutes.
Now that now we have a pleasant new shiny fine-tuned mannequin for predicting HXL tags and attributes, we are able to use the check file to take it for a spin …
def make_chat_predictions(prompts, mannequin, temperature=0.1, max_tokens=13):
"""
Generate chat predictions primarily based on given prompts utilizing the OpenAI chat mannequin.Args:
prompts (record): An inventory of prompts, the place every immediate is a dictionary containing an inventory of messages.
Every message within the record has a 'position' (both 'system', 'consumer', or 'assistant') and 'content material'.
mannequin (str): The title or ID of the OpenAI chat mannequin to make use of for predictions.
temperature (float, non-compulsory): Controls the randomness of the predictions. Larger values (e.g., 0.5) make the
output extra random, whereas decrease values (e.g., 0.1) make it extra deterministic.
Defaults to 0.1.
max_tokens (int, non-compulsory): The utmost variety of tokens within the predicted response. Defaults to 13.
Returns:
pandas.DataFrame: A DataFrame containing the outcomes of the chat predictions. Every row within the DataFrame
corresponds to a immediate and contains the immediate messages, the precise message, and the
predicted message.
"""
outcomes = []
for p in prompts:
precise = p["messages"][-1]["content"]
p["messages"] = p["messages"][0:2]
completion = shopper.chat.completions.create(
mannequin=mannequin,
messages=p["messages"],
temperature=temperature,
max_tokens=max_tokens
)
predicted = completion.decisions[0].message.content material
predicted = filter_for_schema(predicted)
res = {
"immediate": p["messages"],
"precise": precise,
"predicted": predicted
}
print(f"Predicted: {predicted}; Precise: {precise}")
outcomes.append(res)
outcomes = pd.DataFrame(outcomes)
return outcomes
def filter_for_schema(textual content):
"""
Filters the enter textual content to extract accredited HXL schema tokens.
Args:
textual content (str): The enter textual content to be filtered.
Returns:
str: The filtered textual content containing solely accredited HXL schema tokens.
"""
if " " in textual content:
textual content = textual content.substitute(" ","")
tokens_raw = textual content.break up("+")
tokens = [tokens_raw[0]]
for t in tokens_raw[1:]:
tokens.append(f"+{t}")
filtered = []
for t in tokens:
if t in APPROVED_HXL_SCHEMA:
if t not in filtered:
filtered.append(t)
filtered = "".be a part of(filtered)
if len(filtered) > 0 and filtered[0] != '#':
filtered = ""
return filtered
def output_prediction_metrics(outcomes, prediction_field="predicted", actual_field="precise"):
"""
Prints out mannequin efficiency report for HXL tag prediction. Metrics are for
simply predicting tags, in addition to predicting tags and attributes.
Parameters
----------
outcomes : dataframe
Dataframe of outcomes
prediction_field : str
Discipline title of component with prediction. Helpful for evaluating uncooked and post-processed predictions.
actual_field: str
Discipline title of the particular consequence for comparability with prediction
"""
y_test = []
y_pred = []
y_justtag_test = []
y_justtag_pred = []
for index, r in outcomes.iterrows():
if actual_field not in r and predicted_field not in r:
print("Offered outcomes don't include anticipated values.")
sys.exit()
y_pred.append(r[prediction_field])
y_test.append(r[actual_field])
actual_tag = r[actual_field].break up("+")[0]
predicted_tag = r[prediction_field].break up("+")[0]
y_justtag_test.append(actual_tag)
y_justtag_pred.append(predicted_tag)
print(f"LLM outcomes for {prediction_field}, {len(outcomes)} predictions ...")
print("nJust HXL tags ...n")
print(f"Accuracy: {spherical(accuracy_score(y_justtag_test, y_justtag_pred),2)}")
print(
f"Precision: {spherical(precision_score(y_justtag_test, y_justtag_pred, common='weighted', zero_division=0),2)}"
)
print(
f"Recall: {spherical(recall_score(y_justtag_test, y_justtag_pred, common='weighted', zero_division=0),2)}"
)
print(
f"F1: {spherical(f1_score(y_justtag_test, y_justtag_pred, common='weighted', zero_division=0),2)}"
)
print(f"nTags and attributes with {prediction_field} ...n")
print(f"Accuracy: {spherical(accuracy_score(y_test, y_pred),2)}")
print(
f"Precision: {spherical(precision_score(y_test, y_pred, common='weighted', zero_division=0),2)}"
)
print(
f"Recall: {spherical(recall_score(y_test, y_pred, common='weighted', zero_division=0),2)}"
)
print(
f"F1: {spherical(f1_score(y_test, y_pred, common='weighted', zero_division=0),2)}"
)
return
with open(TEST_FILE) as f:
X_test = [json.loads(line) for line in f]
outcomes = make_chat_predictions(X_test, mannequin)
output_prediction_metrics(outcomes)
print("Performed")
Noting within the above that each one predictions are filtered for allowed tags and attributes as outlined within the HXL normal.
This offers the next outcomes …
LLM outcomes for predicted, 458 predictions ...Simply HXL tags ...
Accuracy: 0.83
Precision: 0.85
Recall: 0.83
F1: 0.82
Tags and attributes with predicted ...
Accuracy: 0.61
Precision: 0.6
Recall: 0.61
F1: 0.57
‘Simply HXL Tags’ means predicting the primary a part of the HXL, for instance if the total HXL is #affected+contaminated+f, the mannequin appropriately obtained the #affected half appropriate. ‘Tags and attributes’ means predicting the total HXL string, ie ‘#affected+contaminated+f’, a a lot tougher problem as a consequence of all of the mixtures potential.
The efficiency isn’t good, however not that unhealthy, particularly as now we have balanced the dataset to cut back the variety of location and date tags and attributes (ie made this examine a bit tougher). There are tens of 1000’s of humanitarian response tables with out HDX, even the above efficiency would seemingly add worth.
Let’s look into circumstances the place predictions didn’t agree with human-labeled knowledge …
The predictions have been saved to a spreadsheet, and I manually went by means of a lot of the predictions that didn’t agree with the labels. You will discover this evaluation right here and summarized under …
What’s fascinating is that in some circumstances the LLM is definitely appropriate, for instance in including further HXL attributes which the human labeled knowledge doesn’t embody. There are additionally circumstances the place the human labeled HXL was completely affordable, however the LLM predicted one other tag or attribute that is also interpreted as appropriate. For instance a #area may also be an #admin1 in some nations, and whether or not one thing is an +id or +code is usually troublesome to resolve, each are applicable.
Utilizing the above classes, I created a brand new check set the place the anticipated HXL tags have been corrected. On re-running the prediction we get improved outcomes …
Simply HXL tags ...Accuracy: 0.88
Precision: 0.88
Recall: 0.88
F1: 0.88
Tags and attributes with predicted ...
Accuracy: 0.66
Precision: 0.71
Recall: 0.66
F1: 0.66
The above reveals that the human-labeled knowledge itself will be incorrect. The HXL normal is designed excellently, however could be a problem to memorize for builders and knowledge scientists when setting HXL tags and attributes on knowledge. There are some superb instruments already offered by the HXL group, however generally the HXL remains to be incorrect. This introduces an issue to the fine-tuning method which depends on this human-labeled knowledge for coaching, particularly for much less nicely represented tags and attributes that people will not be utilizing fairly often. It additionally has the limitation of not having the ability to regulate when the metadata normal modifications, for the reason that coaching knowledge wouldn’t replicate these modifications.
Because the preliminary evaluation 18 months in the past varied LLM suppliers have superior their fashions considerably. OpenAI in fact launched GPT-4o as their flagship product, which importantly has a context window of 128k tokens and is one other knowledge level suggesting prices of foundational fashions are reducing (see for instance GPT-4-Turbo in comparison with GPT-4o right here). Given these elements, I puzzled …
If fashions have gotten extra highly effective and cheaper to make use of, may we keep away from fine-tuning altogether and use them to foretell HXL tags and attributes by prompting alone?
Not solely may this imply much less engineering work to scrub knowledge and fine-tune fashions, it could have a giant benefit in having the ability to embody HXL tags and attributes which aren’t included within the human-labeled coaching knowledge however are a part of the HXL normal. That is one doubtlessly enormous benefit of highly effective LLMs, having the ability to classify with zero- and few-shot prompting.
Fashions like GPT-4o are educated on net knowledge, so I assumed I’d first do a check utilizing one in all our prompts to see if it already knew every thing there was to learn about HXL tags …
What we see is that it appears to learn about HXL syntax, however the reply is inaccurate (the proper reply is ‘#affected+contaminated’), and it has chosen tags and attributes that aren’t within the HXL normal. It’s truly just like what we see with human-tagged HXL.
How about we offer an important elements of the HXL normal within the system immediate?
def generate_hxl_standard_prompt(local_data_file):
"""
Generate a normal immediate for predicting Humanitarian Markup Language (HXL) tags and attributes.Args:
local_data_file (str): The trail to the native knowledge file containing core hashtags and attributes.
Returns:
str: The generated HXL normal immediate.
"""
core_hashtags = pd.read_excel(local_data_file, sheet_name='Core hashtags')
core_hashtags = core_hashtags.loc[core_hashtags["Release status"] == "Launched"]
core_hashtags = core_hashtags[["Hashtag", "Hashtag long description", "Sample HXL"]]
core_attributes = pd.read_excel(local_data_file, sheet_name='Core attributes')
core_attributes = core_attributes.loc[core_attributes["Status"] == "Launched"]
core_attributes = core_attributes[["Attribute", "Attribute long description", "Suggested hashtags (selected)"]]
print(core_hashtags.form)
print(core_attributes.form)
core_hashtags = core_hashtags.to_dict(orient='data')
core_attributes = core_attributes.to_dict(orient='data')
hxl_prompt= f"""
You're an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of knowledge the place the HXL normal is outlined as follows:
CORE HASHTAGS:
{json.dumps(core_hashtags,indent=4)}
CORE ATTRIBUTES:
{json.dumps(core_attributes, indent=4)}
Key factors:
- ALWAYS predict hash tags
- NEVER predict a tag which isn't a legitimate core hashtag
- NEVER begin with a core hashtag, you need to at all times begin with a core hashtag
- All the time attempt to predict an attribute if potential
- Don't use attribute +code if the info examples are human readable names
You could return your consequence as a JSON report with the fields 'predicted' and 'reasoning', every is of kind string.
"""
print(len(hxl_prompt.break up(" ")))
print(hxl_prompt)
return hxl_prompt
This offers us a immediate like this …
You're an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of knowledge the place the HXL normal is outlined as follows:CORE HASHTAGS:
[
{
"Hashtag": "#access",
"Hashtag long description": "Accessiblity and constraints on access to a market, distribution point, facility, etc.",
"Sample HXL": "#access +type"
},
{
"Hashtag": "#activity",
"Hashtag long description": "A programme, project, or other activity. This hashtag applies to all levels; use the attributes +activity, +project, or +programme to distinguish different hierarchical levels.",
"Sample HXL": "#activity +project"
},
{
"Hashtag": "#adm1",
"Hashtag long description": "Top-level subnational administrative area (e.g. a governorate in Syria).",
"Sample HXL": "#adm1 +code"
},
{
"Hashtag": "#adm2",
"Hashtag long description": "Second-level subnational administrative area (e.g. a subdivision in Bangladesh).",
"Sample HXL": "#adm2 +name"
},
{
"Hashtag": "#adm3",
"Hashtag long description": "Third-level subnational administrative area (e.g. a subdistrict in Afghanistan).",
"Sample HXL": "#adm3 +code"
},
{
"Hashtag": "#adm4",
"Hashtag long description": "Fourth-level subnational administrative area (e.g. a barangay in the Philippines).",
"Sample HXL": "#adm4 +name"
},
{
"Hashtag": "#adm5",
"Hashtag long description": "Fifth-level subnational administrative area (e.g. a ward of a city).",
"Sample HXL": "#adm5 +code"
},
{
"Hashtag": "#affected",
"Hashtag long description": "Number of people or households affected by an emergency. Subset of #population; superset of #inneed.",
"Sample HXL": "#affected +f +children"
},
{
"Hashtag": "#beneficiary",
"Hashtag long description": "General (non-numeric) information about a person or group meant to benefit from aid activities, e.g. "lactating women".",
"Sample HXL": "#beneficiary +name"
},
{
"Hashtag": "#capacity",
"Hashtag long description": "The response capacity of the entity being described (e.g. "25 beds").",
"Sample HXL": "#capacity +num"
},
... Truncated for brevity
},
{
"Hashtag": "#targeted",
"Hashtag long description": "Number of people or households targeted for humanitarian assistance. Subset of #inneed; superset of #reached.",
"Sample HXL": "#targeted +f +adult"
},
{
"Hashtag": "#value",
"Hashtag long description": "A monetary value, such as the price of goods in a market, a project budget, or the amount of cash transferred to beneficiaries. May be used together with #currency in financial or cash data.",
"Sample HXL": "#value +transfer"
}
]
CORE ATTRIBUTES:
[
{
"Attribute": "+abducted",
"Attribute long description": "Hashtag refers to people who have been abducted.",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached"
},
{
"Attribute": "+activity",
"Attribute long description": "The implementers classify this activity as an "activity" proper (may imply different hierarchical levels in different contexts).",
"Suggested hashtags (selected)": "#activity"
},
{
"Attribute": "+adolescents",
"Attribute long description": "Adolescents, loosely defined (precise age range varies); may overlap +children and +adult. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+adolescents +age12_17".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+adults",
"Attribute long description": "Adults, loosely defined (precise age range varies); may overlap +adolescents and +elderly. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+adults +age18_64".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+approved",
"Attribute long description": "Date or time when something was approved.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+bounds",
"Attribute long description": "Boundary data (e.g. inline GeoJSON).",
"Suggested hashtags (selected)": "#geo"
},
{
"Attribute": "+budget",
"Attribute long description": "Used with #value to indicate that the amount is planned/approved/budgeted rather than actually spent.",
"Suggested hashtags (selected)": "#value"
},
{
"Attribute": "+canceled",
"Attribute long description": "Date or time when something (e.g. an #activity) was canceled.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+children",
"Attribute long description": "The associated hashtag applies to non-adults, loosely defined (precise age range varies; may overlap +infants and +adolescents). You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+children +age3_11".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+cluster",
"Attribute long description": "Identifies a sector as a formal IASC humanitarian cluster.",
"Suggested hashtags (selected)": "#sector"
},
{
"Attribute": "+code",
"Attribute long description": "A unique, machine-readable code.",
"Suggested hashtags (selected)": "#region, #country, #adm1, #adm2, #adm3, #adm4, #adm5, #loc, #beneficiary, #activity, #org, #sector, #subsector, #indicator, #output, #crisis, #cause, #impact, #severity, #service, #need, #currency, #item, #need, #service, #channel, #modality, #event, #group, #status"
},
{
"Attribute": "+converted",
"Attribute long description": "Date or time used for converting a monetary value to another currency.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+coord",
"Attribute long description": "Geodetic coordinates (lat+lon together).",
"Suggested hashtags (selected)": "#geo"
},
{
"Attribute": "+dest",
"Attribute long description": "Place of destination (intended or actual).",
"Suggested hashtags (selected)": "#region, #country, #adm1, #adm2, #adm3, #adm4, #adm5, #loc"
},
{
"Attribute": "+displaced",
"Attribute long description": "Displaced people or households. Refers to all types of displacement: use +idps or +refugees to be more specific.",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+elderly",
"Attribute long description": "Elderly people, loosely defined (precise age range varies). May overlap +adults. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+elderly +age65plus".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
... Truncated for brevity
{
"Attribute": "+url",
"Attribute long description": "The data consists of web links related to the main hashtag (e.g. for an #org, #service, #activity, #loc, etc).",
"Suggested hashtags (selected)": "#contact, #org, #activity, #service, #meta"
},
{
"Attribute": "+used",
"Attribute long description": "Refers to a #service, #item, etc. that affected people have actually consumed or otherwise taken advantage of.",
"Suggested hashtags (selected)": "#service, #item"
}
]
Key factors:
- ALWAYS predict hash tags
- NEVER predict a tag which isn't a legitimate core hashtag
- NEVER begin with a core hashtag, you need to at all times begin with a core hashtag
- All the time attempt to predict an attribute if potential
You could return your consequence as a JSON report with the fields 'predicted' and 'reasoning', every is of kind string.
It’s fairly lengthy (the above has been truncated), however encapsulates the HXL normal.
One other benefit of the direct immediate methodology is that we are able to additionally ask for the LLM to offer its reasoning when predicting HXL. This may in fact embody hallucination, however I’ve at all times discovered it helpful for refining prompts.
For the consumer immediate, we’ll use the identical info that we used for fine-tuning, to incorporate excerpt and LLM-generated desk abstract …
What are the HXL tags and attributes for a column with these particulars? resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/knowledge/IFRC Appeals Knowledge for South Sudan8.csv';
dataset_description='The dataset comprises info on varied
appeals and occasions associated to South Sudan,
together with particulars equivalent to the kind of enchantment,
standing, sector, quantity requested and funded,
begin and finish dates, in addition to country-specific
info like nation code, area, and common
family dimension. The information contains appeals for
completely different crises equivalent to floods, inhabitants
actions, cholera outbreaks, and Ebola preparedness,
with particulars on beneficiaries and affirmation wants.
The dataset additionally contains metadata equivalent to IDs,
names, and translation modules for nations and areas.';
column_name:'assist';
examples: ['18401', '17770', '17721', '16858', '15268', '15113', '14826', '14230', '12788', '9286', '8561']
Placing all of it collectively, and prompting each GPT-4o-mini and GPT-4o for comparability …
def call_gpt(immediate, system_prompt, mannequin, temperature, top_p, max_tokens):
"""
Calls the GPT mannequin to generate a response primarily based on the given immediate and system immediate.Args:
immediate (str): The consumer's enter immediate.
system_prompt (str): The system's enter immediate.
mannequin (str): The title or ID of the GPT mannequin to make use of.
temperature (float): Controls the randomness of the generated output. Larger values (e.g., 0.8) make the output extra random, whereas decrease values (e.g., 0.2) make it extra deterministic.
top_p (float): Controls the variety of the generated output. Larger values (e.g., 0.8) make the output extra various, whereas decrease values (e.g., 0.2) make it extra targeted.
max_tokens (int): The utmost variety of tokens to generate within the response.
Returns:
dict or None: The generated response as a dictionary object, or None if an error occurred throughout technology.
"""
response = shopper.chat.completions.create(
mannequin=mannequin,
messages= [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
max_tokens=2000,
temperature=temperature,
top_p=top_p,
frequency_penalty=0,
presence_penalty=0,
cease=None,
stream=False,
response_format={ "kind": "json_object" }
)
consequence = response.decisions[0].message.content material
consequence = consequence.substitute("```json","").substitute("```","")
strive:
consequence = json.masses(consequence)
consequence["predicted"] = consequence["predicted"].substitute(" ","")
besides:
print(consequence)
consequence = None
return consequence
def make_prompt_predictions(prompts, mannequin, temperature=0.1, top_p=0.1,
max_tokens=2000, debug=False, actual_field="precise"):
"""
Generate predictions for a given set of prompts utilizing the desired mannequin.
Args:
prompts (pandas.DataFrame): A DataFrame containing the prompts to generate predictions for.
mannequin (str): The title of the mannequin to make use of for prediction.
temperature (float, non-compulsory): The temperature parameter for the mannequin's sampling. Defaults to 0.1.
top_p (float, non-compulsory): The highest-p parameter for the mannequin's sampling. Defaults to 0.1.
max_tokens (int, non-compulsory): The utmost variety of tokens to generate for every immediate. Defaults to 2000.
debug (bool, non-compulsory): Whether or not to print debug info throughout prediction. Defaults to False.
actual_field (str, non-compulsory): The title of the column within the prompts DataFrame that comprises the precise values. Defaults to "precise".
Returns:
pandas.DataFrame: A DataFrame containing the outcomes of the predictions, together with the immediate, precise worth, predicted worth, and reasoning.
"""
num_prompts = len(prompts)
print(f"Variety of prompts: {num_prompts}")
outcomes = []
for index, p in prompts.iterrows():
if index % 50 == 0:
print(f"{index/num_prompts*100:.2f}% full")
immediate = p["prompt"]
immediate = ast.literal_eval(immediate)
immediate = immediate[1]["content"]
precise = p[actual_field]
consequence = call_gpt(immediate, hxl_prompt, mannequin, temperature, top_p, max_tokens)
if result's None:
print(" !!!!! No LLM consequence")
predicted = ""
reasoning = ""
else:
predicted = consequence["predicted"]
reasoning = consequence["reasoning"]
if debug is True:
print(f"Precise: {precise}; Predicted: {predicted}; Reasoning: {reasoning}")
outcomes.append({
"immediate": immediate,
"precise": precise,
"predicted": predicted,
"reasoning": reasoning
})
outcomes = pd.DataFrame(outcomes)
print(f"nn===================== {mannequin} Outcomes =========================nn")
output_prediction_metrics(outcomes)
print(f"nn=================================================================")
outcomes["match"] = outcomes['predicted'] == outcomes['actual']
outcomes.to_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prompting-only-prediction-{mannequin}-results.xlsx", index=False)
return outcomes
for mannequin in ["gpt-4o-mini","gpt-4o"]:
print(f"Mannequin: {mannequin}")
outcomes = make_prompt_predictions(X_test, mannequin, temperature=0.1, top_p=0.1, max_tokens=2000)
We get …
===================== gpt-4o-mini Outcomes =========================LLM outcomes for predicted, 458 predictions ...
Simply HXL tags ...
Accuracy: 0.77
Precision: 0.83
Recall: 0.77
F1: 0.77
Tags and attributes with predicted ...
Accuracy: 0.53
Precision: 0.54
Recall: 0.53
F1: 0.5
===================== gpt-4o Outcomes =========================
LLM outcomes for predicted, 458 predictions ...
Simply HXL tags ...
Accuracy: 0.86
Precision: 0.86
Recall: 0.86
F1: 0.85
Tags and attributes with predicted ...
Accuracy: 0.71
Precision: 0.7
Recall: 0.71
F1: 0.69
=================================================================
As a reminder, the fine-tuned mannequin produced the next outcomes …
Simply HXL tags ...Accuracy: 0.83
Precision: 0.85
Recall: 0.83
F1: 0.82
Tags and attributes with predicted ...
Accuracy: 0.61
Precision: 0.6
Recall: 0.61
F1: 0.57
How does prompting-only GPT-4o examine with GPT-4o-mini?
Wanting on the above, we see that GPT-4o-mini prompting-only predicts simply tags with 77% accuracy, which is lower than GPT-4o-mini fine-tuning (83%) and GPT-4o prompting-only (86%). That mentioned the efficiency remains to be good and would enhance HXL protection even when used as-is.
How does prompting-only examine with the fine-tuned mannequin?
GPT-4o prompting-only gave one of the best outcomes of all fashions, with 86% accuracy on tags and 71% on tags and attributes. Actually, the efficiency may nicely be higher after a bit extra evaluation of the check knowledge to appropriate incorrect human-labeled tags,.
Let’s take a better take a look at the occasions GPT-4o obtained it improper …
df = pd.read_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prompting-only-prediction-gpt-4o-results.xlsx")breaks = df[df["match"]==False]
print(breaks.form)
for index, row in breaks.iterrows():
print("n======================================== ")
pprint.pp(f"nPrompt: {row['prompt']}")
print()
print(f"Precise", row["actual"])
print(f"Predicted", row["predicted"])
print()
pprint.pp(f'Reasoning: n{row["reasoning"]}')
'n'
'Immediate: What are the HXL tags and attributes for a column with these '
'particulars? '
"resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/knowledge/IFRC "
"Appeals Knowledge for South Sudan8.csv'; dataset_description='The dataset "
'comprises info on varied appeals and occasions associated to South Sudan, '
'together with particulars equivalent to the kind of enchantment, standing, sector, quantity '
'requested and funded, begin and finish dates, in addition to country-specific '
'info like nation code, area, and common family dimension. The information '
'contains appeals for various crises equivalent to floods, inhabitants actions, '
'cholera outbreaks, and Ebola preparedness, with particulars on beneficiaries and '
'affirmation wants. The dataset additionally contains metadata equivalent to IDs, names, '
"and translation modules for nations and areas.'; column_name:'dtype.id'; "
"examples: ['12', '5', '1', '1', '12', '12', '1', '6', '1', '1', '7']")Precise #trigger+id
Predicted #meta+id
('Reasoning: n'
"The column 'dtype.id' comprises numeric identifiers (e.g., '12', '5', '1') "
'that are more likely to be inside identifiers for knowledge data. In response to '
'the HXL normal, the suitable hashtag for inside identifiers is '
"'#meta' with the attribute '+id'.")
========================================
('n'
'Immediate: What are the HXL tags and attributes for a column with these '
'particulars? '
"resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/knowledge/IFRC "
"Appeals Knowledge for South Sudan8.csv'; dataset_description='The dataset "
'comprises info on varied appeals and occasions associated to South Sudan, '
'together with particulars equivalent to the kind of enchantment, standing, sector, quantity '
'requested and funded, begin and finish dates, in addition to country-specific '
'info like nation code, area, and common family dimension. The information '
'contains appeals for various crises equivalent to floods, inhabitants actions, '
'cholera outbreaks, and Ebola preparedness, with particulars on beneficiaries and '
'affirmation wants. The dataset additionally contains metadata equivalent to IDs, names, '
"and translation modules for nations and areas.'; "
"column_name:'dtype.title'; examples: ['Flood', 'Population Movement', "
"'Epidemic', 'Epidemic', 'Flood', 'Flood', 'Epidemic', 'Complex Emergency', "
"'Epidemic', 'Epidemic', 'Civil Unrest']")
Precise #trigger+title
Predicted #occasion+kind
('Reasoning: n'
"The examples offered within the column ('Flood', 'Inhabitants Motion', "
"'Epidemic', 'Complicated Emergency', 'Civil Unrest') describe several types of "
'occasions or incidents inside a disaster or emergency. In response to the HXL '
'normal, the suitable hashtag for such a knowledge is #occasion, and the '
'attribute +kind is used to specify the kind or class of the occasion.')
========================================
('n'
'Immediate: What are the HXL tags and attributes for a column with these '
'particulars? '
"resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/knowledge/IFRC "
"Appeals Knowledge for South Sudan8.csv'; dataset_description='The dataset "
'comprises info on varied appeals and occasions associated to South Sudan, '
'together with particulars equivalent to the kind of enchantment, standing, sector, quantity '
'requested and funded, begin and finish dates, in addition to country-specific '
'info like nation code, area, and common family dimension. The information '
'contains appeals for various crises equivalent to floods, inhabitants actions, '
'cholera outbreaks, and Ebola preparedness, with particulars on beneficiaries and '
'affirmation wants. The dataset additionally contains metadata equivalent to IDs, names, '
"and translation modules for nations and areas.'; "
"column_name:'status_display'; examples: ['Active', 'Active', 'Closed', "
"'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed', "
"'Closed']")
Precise #standing+title
Predicted #standing+code
('Reasoning: n'
"The column 'status_display' comprises values equivalent to 'Energetic' and 'Closed', "
'which describe the standing of appeals or occasions. The suitable HXL hashtag '
'for mission or exercise standing is #standing. Because the values are categorical '
'and symbolize completely different statuses, the attribute +code is appropriate to '
'point out these standing codes.')
========================================
('n'
'Immediate: What are the HXL tags and attributes for a column with these '
'particulars? '
"resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/knowledge/IFRC "
"Appeals Knowledge for South Sudan8.csv'; dataset_description='The dataset "
'comprises info on varied appeals and occasions associated to South Sudan, '
'together with particulars equivalent to the kind of enchantment, standing, sector, quantity '
'requested and funded, begin and finish dates, in addition to country-specific '
'info like nation code, area, and common family dimension. The information '
'contains appeals for various crises equivalent to floods, inhabitants actions, '
'cholera outbreaks, and Ebola preparedness, with particulars on beneficiaries and '
'affirmation wants. The dataset additionally contains metadata equivalent to IDs, names, '
"and translation modules for nations and areas.'; "
"column_name:'area.id'; examples: ['0', '0', '0', '0', '0', '0', '0', '0', "
"'0', '0', '0']")
Precise #adm1+code
Predicted #area+id
('Reasoning: n'
"The column 'area.id' comprises numeric identifiers for areas, which "
'aligns with the HXL tag #area and the attribute +id. The examples offered '
'are all numeric, indicating that these are seemingly distinctive identifiers for '
'areas.')
========================================
Discover how we now have a ‘Reasoning’ area to point why the tags have been chosen. That is helpful and can be an essential half for refining the immediate to enhance efficiency.
Wanting on the pattern above, we see some acquainted eventualities that have been discovered when analyzing the fine-tuned mannequin failed predictions …
- +id and +code ambiguity
- #area and #adm1 used interchangeably
- #occasion versus extra detailed tags like #trigger
These appear to fall into the class the place two tags are potential for a given column given their HXL definition. However there are some actual discrepancies which would want extra investigation.
That mentioned, utilizing GPT-4o to foretell HXL tags and attributes yields one of the best outcomes, and I imagine at a suitable degree given a number of knowledge is lacking HXL metadata altogether and most of the datasets which have it have incorrect tags and attributes.
Let’s see how prices examine with every approach and mannequin …
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""
Returns the variety of tokens in a textual content string utilizing toktoken.
See: https://github.com/openai/openai-cookbook/blob/principal/examples/How_to_count_tokens_with_tiktoken.ipynbArgs:
string (str): The textual content string to depend the tokens for.
encoding_name (str): The title of the encoding to make use of.
Returns:
num_tokens: The variety of tokens within the textual content string.
"""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
def calc_costs(knowledge, mannequin, methodology="prompting"):
"""
Calculate token prices for a given dataset, methodology and mannequin.
Word: Just for inference prices, not fine-tuning
Args:
knowledge (pandas.DataFrame): The information to get the tokens for.
methodology (str, non-compulsory): The tactic to make use of. Defaults to "prompting".
mannequin (str): The mannequin to make use of, eg "gpt-4o-mini"
Returns:
input_tokens: The variety of enter tokens.
output_tokens: The variety of output tokens.
"""
# See https://openai.com/api/pricing/
value = {
"gpt-4o-mini": {
"enter": 0.150,
"output": 0.600
},
"gpt-4o": {
"enter": 5.00,
"output": 15.00
}
}
input_tokens = 0
output_tokens = 0
for index, p in knowledge.iterrows():
immediate = p["prompt"]
immediate = ast.literal_eval(immediate)
enter = immediate[1]["content"]
# If prompting, we should embody system immediate
if methodology == "prompting":
enter += " " + hxl_prompt
output = p["Corrected actual"]
input_tokens += num_tokens_from_string(str(enter), "cl100k_base")
output_tokens += num_tokens_from_string(str(output), "cl100k_base")
input_cost = input_tokens / 1000000 * value[model]["input"]
output_cost = output_tokens / 1000000 * value[model]["output"]
print(f"nFor {knowledge.form[0]} desk columns the place we predicted HXL tags ...")
print(f"{methodology} prediction with mannequin {mannequin}, {input_tokens} enter tokens = ${input_cost}")
print(f"Positive-tuning prediction GPT-4o-mini {output_tokens} output tokens = ${output_cost}n")
hxl_prompt = generate_hxl_standard_prompt(HXL_SCHEMA_LOCAL_FILE, debug=False)
X_test2 = pd.read_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-fine-tune-prediction-results-review.xlsx", sheet_name=0)
calc_costs(X_test2, methodology="fine-tuning", mannequin="gpt-4o-mini")
calc_costs(X_test2, methodology="prompting", mannequin="gpt-4o-mini")
calc_costs(X_test2, methodology="prompting", mannequin="gpt-4o")
Which provides …
For 458 desk columns the place we predicted HXL tags ...
fine-tuning prediction with mannequin gpt-4o-mini, 99738 enter tokens = $0.014960699999999999
Positive-tuning prediction GPT-4o-mini 2001 output tokens = $0.0012006For 458 desk columns the place we predicted HXL tags ...
prompting prediction with mannequin gpt-4o-mini, 2688812 enter tokens = $0.4033218
Positive-tuning prediction GPT-4o-mini 2001 output tokens = $0.0012006
For 458 desk columns the place we predicted HXL tags ...
prompting prediction with mannequin gpt-4o, 2688812 enter tokens = $13.44406
Positive-tuning prediction GPT-4o-mini 2001 output tokens = $0.030015000000000003
Word: the above is just for the inference value, there can be a really small further value in producing desk knowledge summaries with GPT-3.5.
Given the check set, predicting HXL for 458 columns …
Positive-tuning:
As anticipated, inference prices for the fine-tuned GPT-4o mini mannequin (which value about $7 to fine-tune) are very low about $0.02.
Prediction-only:
- GPT-4o prediction solely is dear, due to the HXL normal being handed in to the system immediate each time, and comes out at $13.44.
- GPT-4o-mini, albeit with diminished efficiency, is a extra affordable $0.40.
So ease of use comes with a value if utilizing GPT-4o, however GPT-4o-mini is a sexy various.
Lastly, it’s price noting that in lots of circumstances, setting HXL tags would possibly to not be actual time, for instance for a crawler course of that corrects already uploaded datasets. This could imply that the brand new OpenAI batch API may very well be used, lowering prices by 50%.
Placing this all collectively, I created a Github gist hxl_utils.py. Test this out from GitHub and place the file in your present working listing.
Let’s obtain a file to check it with …
# See HDX for this file: https://knowledge.humdata.org/dataset/sudan-acled-conflict-data
DATAFILE_URL="https://knowledge.humdata.org/dataset/5efad450-8b15-4867-b7b3-8a25b455eed8/useful resource/3352a0d8-2996-4e70-b618-3be58699be7f/obtain/sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx"
local_data_file = f"{LOCAL_DATA_DIR}/{DATAFILE_URL.break up('/')[-1]}"# Save knowledge file domestically
urllib.request.urlretrieve(DATAFILE_URL, local_data_file)
# Learn it to get a dataframe
df = pd.read_excel(local_data_file, sheet_name=1)
And utilizing this dataframe, let’s predict HXL tags …
from hxl_utils import HXLUtilshxl_utils = HXLUtils(LOCAL_DATA_DIR, mannequin="gpt-4o")
knowledge = hxl_utils.add_hxl(df,"sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx")
print("nnAFTER: nn")
show(knowledge)
And there now we have it, some beautiful HXL tags!
Let’s see how nicely GPT-4o-mini does …
hxl_utils = HXLUtils(LOCAL_DATA_DIR, mannequin="gpt-4o-mini")
knowledge = hxl_utils.add_hxl(df,"sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx")
Which provides …
Fairly good! gpt-4o gave “#affected+killed+num” for the final column, the place “gpt-4o-mini” gave “#affected+num”, however this might seemingly be resolved with some deft immediate engineering.
Admittedly this wasn’t a very difficult dataset, however it was capable of appropriately predict tags for occasions and fatalities, that are much less frequent than location and dates.
I feel a giant takeaway right here is that the direct-prompting approach produces good outcomes with out the necessity for coaching. Sure, dearer for inference, however perhaps not if an information scientist is required to curate incorrectly human-labeled fine-tuning knowledge. It might depend upon the group and metadata use-case.
Listed here are some areas that is perhaps thought-about in future work …
Improved check knowledge
This evaluation did a fast assessment of the check set to appropriate HXL tags which have been incorrect within the knowledge or had a number of potential values. Extra time may very well be spent on this, as at all times in machine studying, floor reality is essential.
Immediate engineering and hyperparameter tuning
The above evaluation makes use of very fundamental prompts with no actual engineering or methods utilized, these may positively be improved for higher efficiency. With an analysis set and a framework equivalent to Promptflow, immediate variants may very well be examined. Moreover we’d add extra context knowledge, for instance in deciding administrative ranges, which might range per nation. Lastly, now we have used fastened hyperparameters for temperature and top_p, in addition to completion token size. All these may very well be tuned main to higher efficiency.
Price optimization
The prompting-only method positively seems to be a powerful possibility and simplifies how a corporation can routinely set HXL tags on their knowledge utilizing GPT-4o. There are in fact value implications with this mannequin, being a dearer, however predictions happen solely on low-volume schema modifications, not when the underlying knowledge itself modifications, and with new choices for batch submission on OpenAI and ever reducing LLM prices, this method seems viable for a lot of organizations. GPT-4o-mini additionally performs nicely and is a fraction of the associated fee.
Utility to different metadata requirements
It might be fascinating to use this method to different metadata and labeling requirements, I’m positive many organizations are already utilizing LLMs for this.
Please like this text if inclined and I’d be delighted should you adopted me! You will discover extra articles right here.