PhenCards Project (2019)

For this Project, I applied PySpark to merge databases from several sources and used Docker to build images of the project. Here is an overview of the data sorces I used:

INDEXEXTERNAL-DATABASE-NAMESOURCE-LINKINSTRUCTIONSCOMMENTS
1ICD-10ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/2020/icd10cm_order_2020.txtNo permission requiredUse the link to download
2ICD-9ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD-9/ucod.txtNo permission requiredUse the link to download
3ICD-Ohttps://github.com/philipsales/icdoncology-3/blob/master/icd-oncology.v3.jsonNo permission requiredUse the link to download
4SNOMEDCT_UShttps://download.nlm.nih.gov/umls/kss/IHTSDO20200131/SnomedCT_InternationalRF2_PRODUCTION_20200131T120000Z.zipIf account/password is needed, use the following: /Need account information for download permission, around 500M
5UMLShttps://download.nlm.nih.gov/umls/kss/2019AB/umls-2019AB-full.zipIf account/password is needed, use the following: /Need account information for download permission, around 4GB; here is a useful tool to download using cluster: https://askubuntu.com/questions/29079/how-do-i-provide-a-username-and-password-to-wget
6MeSHftp://nlmpubs.nlm.nih.gov/online/mesh/MESH_FILES/xmlmesh/desc2020.gzNo permission requiredUse the link to download
7DOIDhttp://purl.obolibrary.org/obo/doid.owlNo permission requiredThis is an HTML format file, need attention for parsing