Creating Data Pipelines with the Wrong Data

5 min readApr 29, 2021

As part of a team of data scientists, I was tasked with extracting relevant data fields from Asylum case decisions using Spacy’s Natural Language Processing model to populate a PostgreSQL database.

The Stakeholder’s Vision

To qualify for asylum in the United States, one must establish the applicant has a fear of persecution in their home country on account of at least one of 5 protected grounds- political opinion, race, religion, nationality, or particular social group. Immigration judges have a fair amount of discretion, the chances of getting approved vary greatly depending on the court’s location or the judge assigned to the case. In the U.S, there are 58 immigration courts with about 465 immigration judges. Every judge has their own perspective and biases they bring to the podium, making it difficult for immigration lawyers to prepare their case. Currently, there is no centralized database of asylum case data for immigration lawyers. Right now, lawyers use location-specific Facebook groups to talk about their cases and get others’ experiences with a certain judge.

Human Rights First’s, a non-profit organization, pro bono legal representation program matches lawyers to asylum-seekers who need help and cannot otherwise afford a high-quality lawyer. HRF requested a tool that serves as a single location for lawyers to upload their asylum decisions and empower them with the ability to explore, filter, and visualize case data as they see fit to best discover judge biases and inform policy recommendations.

The team was assigned to this project 5 months after its conception- the database architecture already decided and a proof-of-concept deployed- and consisted of 11 data scientists, 4 front-end, and 5 back-end developers. The PDF files are uploaded through the front-end and converted to text with Pytesseract’s Optical Character Recognition engine. Using NLP, the text is then analyzed for relevant data fields as specified by the HRF stakeholder. The structured data is then converted to a JSON object, stored in an S3 bucket, and uploaded to an Amazon Web Services’ PostgreSQL database. The front-end displays the data as a dashboard using Plotly, allowing the user to visualize key metrics as they see fit.

The Problem

My fellow data scientists and I were tasked with implementing an accurate, automated system for scraping and uploading the case data to the back-end. The previous DS team implemented some light scraping functionality; however, the logic relied on matching exact phrases within the text. This resulted in poor accuracy, as the format and verbiage of judges’ decisions is not standard across immigration courts.

While getting familiar with our test data, we quickly realized the dataset only consisted of appellate cases. Asylum decisions are appealed at the Board of Immigration Appeals, a federal court. Our entire dataset only comes from this one location, which means they have the exact same format. Initial immigration decisions are at a state-level, where FOIA doesn’t apply, so other than a few initial decisions our stakeholder could share, there’s a potential 58 other formats our scraper hasn’t seen before.

Left: Title Page of Initial Asylum Decision, Chicago — Right: Title Page of an Appellate Decision, Federal

Data Extraction Methods

Our scraper was already resulting in low accuracy and we only had a single document format. Each member of the DS team was assigned to 1–2 data extraction methods to either improve upon the work of the previous team or build the function from scratch.

I was tasked with scraping with city and state of the case. The stakeholder expressed the importance of returning the correct city, as states such as California have 9 different immigration courts. Our test data being only appeals, my initial approach of using a Regex match of a city, state pattern would always return the appellate court’s location, as it’s always the first location in the text and the most frequent. To isolate the location of the initial decisions, I scraped all the state immigration courts locations from the Executive Office for Immigration Review website using Beautiful Soup. I figured the court locations probably don’t change often, so I saved the scraped data into a JSON file in our repository for faster look-ups.

Addressing the Dataset

As my teammates and I added to the codebase, we needed a way to check if we were improving upon the scraper’s accuracy. We were literally reading each document to determine the correct answer and comparing it to the function’s output. To speed up this process, I organized the members of my team to read through 15 documents each and ‘manually extract’ the data into an Excel file. This task took us about two days, but it greatly improved our ability to check our functions’ performance. The process ended up helping us identify edge cases to bring up to the stakeholder, such as returning the gender when the decision includes more than one person or important precedents.

Results

Though the scraper is far from perfect, my team and I made a lot of progress in 4 weeks. My team and I automated 8 of the 16 data fields, some of which reach 97% accuracy for appellate cases. Though our manual scraping efforts did help us validate our outputs, the Excel file will be extremely beneficial in to the next team who works on this project, being able to quickly familiarize themselves with the dataset and offering them the potential to build models. We ensured our code was well-documented and that they would know about the issues with our dataset right away. Not having initial decision data is the greatest challenge of this project and will continue to be. We have no idea if our extraction methods will be successful with state immigration court decisions, therefore diminishing the reliability of the entire application. Immigration lawyers will be using this tool to help people in vulnerable situations be granted asylum and it’s crucial we are delivering them accurate data. However, our progress on the accuracy of scraping appeal cases should allow the next team to focus on creating dummy data of initial immigration decisions and begin finding ways to extract the data from differing formats.

Takeaways

This project reinforced how much I enjoy working on teams, even when fully remote. My teammates and I were in constant communication, always bouncing ideas off each other, and sharing our code daily. Every day, one of my colleagues taught me something new. Working closely with the stakeholder made me realize how devoted I am in making someone else’s vision come to fruition. Lastly, it reminded me of the importance of test data and no matter how tedious, manually extracting & labelling your target data from hundreds of documents will almost always be necessary when building something that has never existed before.