DaxServer
|
File:Marlboro Advance Perforated Holes Filter (India).png
edit- This file is a copyright violation because it comes from: https://www.reddit.com/r/todayilearned/comments/evmn66/til_that_light_cigarettes_are_designed_to_fool
The said reddit post has a URL of the Wiki page on which this image was placed, and reddit crawled wiki to read that image, not vice-versa.
Request you to please indulge in deletion with diligence so that to avoid deleting valid images and wasting time of editors in restoring them. Thank you, User4edits (talk) 05:19, 1 September 2024 (UTC)
Congratulations! It has bot status now. EugeneZelenko (talk) 14:27, 2 September 2024 (UTC)
- Thank you! -- DaxServer (talk) 14:54, 2 September 2024 (UTC)
Kumar Gandharva Shankara has been listed at Commons:Categories for discussion so that the community can discuss ways in which it should be changed. We would appreciate it if you could go to voice your opinion about this at its entry. If you created this category, please note that the fact that it has been proposed for discussion does not necessarily mean that we do not value your kind contribution. It simply means that one person believes that there is some specific problem with it. If the category is up for deletion because it has been superseded, consider the notion that although the category may be deleted, your hard work (which we all greatly appreciate) lives on in the new category. In all cases, please do not take the category discussion personally. It is never intended as such. Thank you! |
Question
editHello Dax! Below I have a question/request, but I would like to know if is even feasible to be done:
Brazil's Superior Electoral Court has an enormous CC-4.0 electoral database of candidates portraits for the Voting Machine dating from 2004 till now. @Pfcab: has made an enormous job of uploading a large chunk (as of right now, there are 14 736 files in Category:Files from Portal de Dados Abertos do TSE, but there's still a lot more to go). The JPEG database has images of candidates for Presidency, Governor, Senators, Deputies (State and Federal level), Mayor, Council people, Vice-Presidency, Vice-Governor and Vice-Mayor. This deletion request closed as keep, while considering that those images are in scope for Commons.
I wonder: it would be possible for a/your bot to upload these files while naming each one like File:2020 LUIZ MARINHO CANDIDATO PREFEITO SP SAO BERNARDO DO CAMPO TSE (250000897682).jpg (YEAR - CANDIDATE - STATE - CITY - TSE - IMAGE ID)? @Sailoratlantis: did a long exposition about this database here.
If the bot can't work with the .Zip files, the site divulgacandcontas.tse.jus.br has the biographical material (see here for example). Considering that the images are from the same database, the bot could recognize links like this one (23,7 kb jpg), the name of the candidate, title it as the example above, categorize it in the subcategories of Category:TSE electoral portraits by year and maybe in the subcategories of Category:Politicians of Brazil by party?
As a side request: there are several images uploaded with several styles of naming (as you may see here). The bot could fix and standardize everything, following the example above (created by Pfcab)? It would also be good to look for uploaded files from the same database, but that are missing the Template:TSE-Dados-Abertos and the category by year.
I don't know more about the technical stuff, but I hope that it was all useful. I only didn't make a work request because it seemed all of more complex than the Commons seems to allow. Thank you very much,
Erick Soares3 (talk) 23:24, 3 September 2024 (UTC)
- On the Council people, deputies and senators, to avoid a bloated upload, I'm in favor of only uploading the elected people (the .CSV files from the Portal have all the data). Erick Soares3 (talk) 23:44, 3 September 2024 (UTC)
- Hi @Erick Soares3 Thanks for asking me. Let me look into that and I'll get back to you with my opinions -- DaxServer (talk) 09:09, 4 September 2024 (UTC)
- @Erick Soares3 Here're my opinions, once the DR is closed and decided that the files are properly licensed.
- The naming format is very much possible and easy as the info exists in the CSV files in the ZIP. It is also possible to categorize under TSE portraits by year and also under the by-party categories. I don't speak Portuguese, so it is not immediately clear which columns in which CSV refer to the political affiliation, but I assume the information is somewhere over there. If all the information provided in the divulgacandcontas.tse.jus.br portal exists in the dataset ZIPs, then it is much easier as we don't need to collect the information from that portal's API rather just download bunch of ZIP files and work on them. If not, there would be some sort of research required into their API and understand what is what.
- For the existing files, standardising the naming format, templating like the TSE template, and categorizing is also possible once the information is collated as above, surely some sort of reconciling needs to be done one way or the other. I think it's better to do the renaming and updating after organizing the info so as to avoid any double work. Looking at that category, there seem to be colorization like this of the original - I guess one of the questions that need some answers, but these will come up once the work is started.
- The dataset has a ton of biographical information about the candidates. Most of that belongs in Wikidata. So, I see this to be a cross-wiki project of very good value - where the uploads go here and bios go in Wikidata and are linked in the SDC. I'm not sure if there is an existing bot that is already working on this data Commons and/or Wikidata, but if not, it shouldn't be much of a hassle once someone takes up on the work.
- All the images of candidates for all the offices stated above are in scope for Commons. I'd upload all, and not just the elected ones. I'd recommend posting this request at Commons:Batch uploading so that others can chime in as well. Do you know how much of the dataset @Pfcab is working on to upload? Just wondering if they planned to do all, then they might have already finished before I or someone else start working on your request. I am interested as well, altho I can only work as the time permits and help is provided with Portuguese. I'd also recommend posting at Wikidata, maybe the Project Chat, about the project and ask for help/opinions/comments. If you need any sort of help from me, feel free to ask.
- Good luck! -- DaxServer (talk) 10:52, 4 September 2024 (UTC)
- Hello Dax! Thank you very much!
- About the parties names, the data is under SG_PARTIDO (Party acronym) and NM_PARTIDO (Party full name);
- Pfcab is only working with the municipal elections: 2004, 2008, 2012, 2016, 2020 and 2024 (he's currently readying the upload of the 2024 data and later, 2008). To easy his workload, he's only working with the Mayor and Vice-Mayor candidates, excluding the Council people (since they don't usually receive Wikipedia articles, he doesn't see them as relevant, but as you said, they are in scope for Commons and Wikidata);
- On the deletion request, I think that it is only a matter of some adm closing it as keep, since the licensing is already solved;
- So, in the batch upload request, I only need to suggest the divulgcand website? At first, I was confused because I had to tell about the TSE Database (with the .zip files) and the divulgcand - but for the request, only one is possible.
- I will look into the Wikidata option! I was also thinking about turning the Category:Files from Portal de Dados Abertos do TSE into a meta category, so the files are subcategorized by year or even "by year, by state". I think that it would be a better organizing, specially if the TSE template is changed for it.
- Thank you again!
- Erick Soares3 (talk) 11:37, 4 September 2024 (UTC)
- Happy to help!
- In the batch upload request, I'd recommend to specify all options possible - like alternatives. One or multiple or all can be picked up based on how the work is carried out. Someone might prefer to work with the CSV dumps rather than API while some others the opposite and some both.
- For example, at Commons:Batch uploading/U.S. Army Corps of Engineers Digital Visual Library, CONTENTdm was initially suggested for the work. I later discovered IIIF installation and documented it. A few discussions later, I ended up using CONTENTdm as JPEGs are being uploaded. If TIFFs are to be uploaded, I'd have used IIIF.
- I think in the end, it would be best to document the possibilities and decide which ever route works, of course need not to say that they'll have to be properly licensed. Hope that helps. -- DaxServer (talk) 11:58, 4 September 2024 (UTC)
- @DaxServer Hi! I have personally uploaded 13808 images for Mayor and Deputy Mayor candidates for municipalities with over 100k residents, as these are the notability criteria on ptWikipedia and was my main goal when first uploading these images. At the momente I've only uploaded the images for 2004, 2012, 2016 and 2020. Therefore, it only covers a small portion of the data and images available. I have the list of file names I've personally uploaded which may help with the reconciling process. There are also many other images uploaded by other users already available on Commons. I've also documented some of these images (using the criteria explained above) using SHA1 hash query with the Mediawiki API. I'm currently running a script to check for ALL 2024 images available as of 2024/09/03. I don't know if there is a faster way to do find if an image has already been uploaded to Commons, as they are all uploaded with different names and may or may not be in the appropriate categories.
- I'm willing to help the efforts as I already have familiarity with the data, but I don't have that much experience with Commons and (especially) Wikidata. Here are some points I've realized about the data:
- The .zip for images from 2008 elections are currently unusable. Each .zip is supposed to include only candidates for an specific State, but for 2008 there was some mixup and each .zip has images from multiple different States. I belive this happend because the images are only named by the ID of the candidate each candidate does not have an unique ID. For example, Candidate A is running for mayor in a city on the state of Rio de Janeiro and Candidate B is running for mayor in a city on the state of São Paulo. They both share the same ID, so if you check the RJ.zip and the SP.zip you will find the images for both candidates. I've already contacted TSE about this issue and they responded saying the problem will be fixed, but did not give any estimate of time to do so. I belive it may be possible to circumvent this problem using the divulgacandcontas.tse.jus.br portal as the source instead of the .zip file. The 2004, 2012, 2016, 2020 and 2024 .zip files are all correct, but i don't know the status for 2006, 2010, 2014, 2018 or 2022.
- The .csv files that accompanies have a lot of interesting information for each candidate, here are some of the fields I find relevant and what they represent:
- NM_UE: Name of the municipality where the candidate is running for office.
- SG_UE: Numerical ID for the municipality where the candidate is running for office. (I'm not sure if this remains the same from election to election)
- SG_UF: Two letter code for the State which the municipality is a part of. (I can provide .json file that maps each two letter code to the full name of the State)
- CD_CARGO: Numerical ID indicating the office the candidate in runninig for. (mayor = 11, deputy mayor = 12, city council member = 13)
- DS_CARGO: Name of the office the candidate in runninig for (mayor: prefeito, deputy mayor: vice-prefeito, city council member: vereador)
- SQ_CANDIDATO: Unique identifier for each candidate (Used to name the image files)
- NR_CANDIDATO: Electoral number of the candidate, is the number you type into the voting machine to select and vote for the candidate.
- NM_CANDIDATO: Full goverment name of the candidate.
- NM_URNA_CANDIDATO: Name that appeas on the eletronic voting machine when voting (Usually shorter or nick-names, see the current president which goes by Lula)
- NM_SOCIAL_CANDIDATO: Social name for trans candidates, see goverment page for an explenation. (Usually left null)
- NM_PARTIDO: Full name of the party the candidate is a part of. (These may change with each election, as people change partys)
- SG_PARTIDO: Shortend name (usually an accronym) of the party the candidate is a part of. (I can provide .json file that maps each name to the correct category on Commons)
- CD_NACIONALIDADE: Numerical ID indicating the candidate nationality (native brazillian= 1, naturalized brazillian = 2)
- DS_NACIONALIDADE: Description of the nationality (brasileira nata = native brazillian, brasileira (naturalizada) = naturalized brazillian)
- SG_UF_NASCIMENTO: Two letter code for the State where the candidate was born.
- NM_MUNICIPIO_NASCIMENTO: Name of the municipality where the candidate was born.
- DT_NASCIMENTO: Date of birth
- DS_GENERO: Gender
- DS_GRAU_INSTRUCAO: Highest level of education achieved
- DS_COR_RACA: Race (white, black, etc)
- DS_OCUPACAO: Job/Occupation (these can range from very specific to pretty generic)
- DS_SIT_TOT_TURNO: Whether the candidate was elected or not.
- All .csv files use "Latin-1" encoding.
- The data on these csv files is very very very dirty, with typos, missing accents etc. It may take a lot of cleaning up to get it all corrected.
- Cheers.
- Pfcab (talk) 14:43, 4 September 2024 (UTC)
- Batch request: Commons:Batch uploading/Files from Portal de Dados Abertos do TSE. Erick Soares3 (talk) 15:40, 4 September 2024 (UTC)
- @DaxServer I have mistakenly asked the creator of this discussion to see what he thinks (I didn't noticed it was created by him). He said that the licensing issues have already been fixed. Erick Soares3 (talk) 11:47, 5 September 2024 (UTC)
- Batch request: Commons:Batch uploading/Files from Portal de Dados Abertos do TSE. Erick Soares3 (talk) 15:40, 4 September 2024 (UTC)
- Hello Dax! Thank you very much!
CuratorBot
editHello DaxServer, this type of errors should be corrected, otherwise the Category:Pages using duplicate arguments in template calls will be full. Regards. -- ato (talk) 21:04, 18 September 2024 (UTC)
- @Ato 01 Thanks for noticing them. I'll check into the errors -- DaxServer (talk) 07:36, 19 September 2024 (UTC)
- I corrected them. Rerunning OpenRefine, all of them should be updated in an hour or so -- DaxServer (talk) 09:36, 19 September 2024 (UTC)
- I think, it would be the best if you get the category in your watchlist, and you see if somewhat went wrong... -- ato (talk) 05:46, 20 September 2024 (UTC)
- I corrected them. Rerunning OpenRefine, all of them should be updated in an hour or so -- DaxServer (talk) 09:36, 19 September 2024 (UTC)