Skip to content

Very clearly infeasible SMILES string generated during batch inference #42

@Eason51

Description

@Eason51

When I batch inference a list of image files using the method predict_images, some clearly invalid occasionally occur in the result. The following is an example of such case:

CC1C(c2ccccc2)=C([Si](C)(C)C)CC1=C123(C)CC456789%10%11%12%13%14%15%16%17%18%19%20%21C%22C%23C4%24CC%2354C5C6C%22%2446%23%25%26%27%28%29%30%31%32%33C57CC684C5C4678C4%22%24(C%34%35%36(C%37%38%39%40C%41%42%43%44(C4%34%37%239C49%23%34C%35%38%41%25%10%37C%10%25%35%38%41C%394%42%26%11%45C%374%11%26%39%42C9%43%10%27%12%37%46C%459%10%12%27%43%47C%23%44%254%28%13%45%48%49C%374%13%23%25%28%44C%34%35%119%29%14%37%50%51%52C%46%459%11%14%29%34%35%53C4%45%46%54%55%56%57%58%59%60%61C%48%374%62%63%64%65%66%67%68C9%37%48%69%70%71%72%73%74%75%76C%49%50%459%77%78%79%80%81%82%83C%38%26%10%13%114%30%15%45%49%50C%51%46%374%10%11%13%15%26%30%38%84C%12%23%14%629%37%46%51%85%86%87%88C%54%489%12%14%23%62%89%90%91%92%93C%29%63%774%48%54%94%95%96%97%98%99C%52%55%69%45%374%29%63%77%(100)%(101)%(102)%(103)C%78%109%37%45%52%55%69%(104)%(105)%(106)%(107)C9%10%78%(108)%(109)%(110)%(111)(C%(112)%(113)%(114)%(115)C9%(116)%(117)%(118)C%(112)%109%(119)C%(113)%(116)%78%10C%(114)%(117)9%(108)C%(115)%(118)%(119)%10%(109))C%1249%10%78%(108)%(109)%(112)%(113)%(114)%(115)%(116)C%14%374%12%(117)%(118)%(119)%(120)%(121)%(122)%(123)%(124)C%64%79%11%23%29%14%37%(125)%(126)%(127)%(128)%(129)C%56%70%49%46%48%459%11%23%29%64%79%(130)C%65%80%13%624%639%45%46%48%49%56C%57%71%51%54%52%14%12%104%13%62%63%65C%89%94%55%379%(117)%78%10%12%14%51%52%54C%58%72%85%95%69%(125)%45%(118)%(108)9%37%55%57C%(110)%15%90%(119)%(109)%77%114%10%45%58%69%70C%(111)%26%91%(120)%(112)%(100)%23%139%124%10%11C%59%73%86%96%(104)%(126)%46%(121)%(113)%14%459%12%13C%41%39%27%25%34%66%81%30%(101)%29%16%14C%60%74%87%97%(105)%(127)%48%(122)%(114)%51%584%15%16C%42%43%28%35%67%82%38%(102)%64%62%379%17C%61%75%50%88%98%(106)%(128)%49%(123)%(115)%14%52%69%10(C%47%44%53%68%83%84%92%(103)%79%63%55%12%15%18)C%76%93%99%(107)%(129)%56%(124)%(116)%(130)%65%57%54%70%11%13%16)C6%22%36%401%31%19)C7%242%32%20)C583%33%21

The invalid predicted strings always occur in such format, i.e., a valid smiles string beginning, followed by a lot of C%(n). When I rerun with the same input list of images, the previously invalid smiles string will become valid, but the other previously valid smiles string might become invalid like this. When I take out those invalid cases and test each single image with predict_image_file, this error never occurs. I wonder why this kind of invalid SMILS string are not filtered out in postprocessing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions